
Evaluating Supervised Machine Learning Methods to Predict Ethnicity from Last Names
We investigate the performance of several different machine learning (ML) algorithms for classifying a person’s ethnicity solely based on their last name, in order to select the most reliable classifier. As the input data is categorical (strings of text), a pre-processing was first done to the input data using indicator variables (or “one-hot encoding”, as known in machine learning), to transform it into numerical data. The following classification methods were used to train a model with a subset of the data: logistic regression, naive Bayes, k-NN and grouped Lasso penalized logistic (GLPL). Our results show that as k-NN is a non-parametric method, it does not perform well on unbalanced data as the training data used are. A logistic regression classifier and the GLPL classifier performed best in terms of accuracy, true positive rate, and false positive rate. These findings encourage to consider and utilize GLPL when performing classification with string-text data, from which there is also an interest that the model learns from the entire sequence of characters.
Advisor: Dr. Enrique Del Castillo.
Files
Metadata
Work Title | Evaluating Supervised Machine Learning Methods to Predict Ethnicity from Last Names |
---|---|
Access | |
Creators |
|
Keyword |
|
License | CC BY-NC 4.0 (Attribution-NonCommercial) |
Work Type | Research Paper |
Publication Date | 2023 |
DOI | doi:10.26207/q889-9918 |
Deposited | October 25, 2023 |
Versions
Analytics
Collections
This resource is currently not in any collection.