Evaluating Supervised Machine Learning Methods to Predict Ethnicity from Last Names

We investigate the performance of several different machine learning (ML) algorithms for classifying a person’s ethnicity solely based on their last name, in order to select the most reliable classifier. As the input data is categorical (strings of text), a pre-processing was first done to the input data using indicator variables (or “one-hot encoding”, as known in machine learning), to transform it into numerical data. The following classification methods were used to train a model with a subset of the data: logistic regression, naive Bayes, k-NN and grouped Lasso penalized logistic (GLPL). Our results show that as k-NN is a non-parametric method, it does not perform well on unbalanced data as the training data used are. A logistic regression classifier and the GLPL classifier performed best in terms of accuracy, true positive rate, and false positive rate. These findings encourage to consider and utilize GLPL when performing classification with string-text data, from which there is also an interest that the model learns from the entire sequence of characters.

Advisor: Dr. Enrique Del Castillo.

Files

Metadata

Work Title Evaluating Supervised Machine Learning Methods to Predict Ethnicity from Last Names
Access
Open Access
Creators
  1. Ana Gabriela Camargo Sandoval
Keyword
  1. Machine learning
  2. Supervised learning
  3. Classification
  4. Ethnicity
  5. Indicator variables
  6. Confusion matrix
License CC BY-NC 4.0 (Attribution-NonCommercial)
Work Type Research Paper
Publication Date 2023
DOI doi:10.26207/q889-9918
Deposited October 25, 2023

Versions

Analytics

Collections

This resource is currently not in any collection.

Work History

Version 1
published

  • Created
  • Updated
  • Added Creator Ana Gabriela Camargo Sandoval
  • Added Research_Paper_Final_AnaCamargo.pdf
  • Updated Description, License Show Changes
    Description
    • We investigate the performance of several different machine learning (ML) algorithms for classifying a person’s ethnicity solely based on their last name, in order to select the most reliable classifier. As the input data is categorical (strings of text), a pre-processing was first done to the input data using indicator variables (or “one-hot encoding”, as known in machine learning), to transform it into numerical data. The following classification methods were used to train a model with a subset of the data: logistic regression, naive Bayes, k-NN and grouped Lasso penalized logistic (GLPL). Our results show that as k-NN is a non-parametric method, it does not perform well on unbalanced data as the training data used are. A logistic regression classifier and the GLPL classifier performed best in terms of accuracy, true positive rate, and false positive rate. These findings encourage to consider and utilize GLPL when performing classification with string-text data, from which there is also an interest that the model learns from the entire sequence of characters.
    • Advisor: Dr. Enrique del Castillo.
    • Advisor: Dr. Enrique Del Castillo.
    License
    • https://creativecommons.org/licenses/by-nc/4.0/
  • Published
  • Updated Keyword Show Changes
    Keyword
    • machine learning, supervised learning, classification, ethnicity, indicator variables, confusion matrix
    • Machine learning, Supervised learning, Classification, Ethnicity, Indicator variables, Confusion matrix
  • Updated