3

I have a csv file which has two columns and about 9,000 rows. Column 1 contains the firstname of a respondent in a survey, column 2 contains the lastname of a respondent in a survey, so each row is an observation.

These surveys were conducted in a very diverse place. I am trying to find a way to tell, whether a respondent's firstname is of English (British or American) origin or not. Same for his lastname.

This task is very far away from my area of expertise. After reading interesting discussions online here, and here. I have thought about three way:

1- Take a dataset of the most common triplets (families of 3 letters often found together in English) or quadruplets (families of 4 letters often found together in English) and to check for each firstname, and lastname, whether it contains these families of letters.

2- Use a dataset of British names (say the most X common names in the UK in the early XX Century, and match these names based on proximity to my dataset. These datasets could be good I think, data1, data2, data3.

3- Use python and an interface to detect what is (most likely) English from what is not.

If anyone has advise on this, can share experience etc that would be great!

I am attaching an example of the data (I made up the names) and of the expected output.

NB: Please note that I am perfectly aware that classifying names according to an English/Non English dichotomy is not without drawbacks and semantic issues.

enter image description here

enter image description here

Marcel Campion
  • 247
  • 1
  • 7
  • Cool question! Don't know the answer, book marking it to seen how you finally solve it. A suggestion would be to extend the approach to match names against Non-English names and use that info also. – lllrnr101 Jan 30 '21 at 12:04
  • That is a very good suggestion! – Marcel Campion Jan 30 '21 at 13:13

2 Answers2

0

Although the best solution would probably be to train a classification model on top of BERT or a similar language model, a crude solution would be to use zero-shot classification. The example below uses transformers. It does a fairly decent job, although you see some semantic issues pop up: the classification of the name Black, for example, is likely distorted due to it also being a color.

import pandas as pd
from transformers import pipeline

data = [['James', 'Brown'], ['Gerhard', 'Schreuder'], ['Musa', 'Bemba'], ['Morris D.', 'Kemba'], ['Evelyne', 'Fontaine'], ['Max D.', 'Kpali Jr.'], ['Musa', 'Black']]
df = pd.DataFrame(data, columns=['firstname', 'name'])
classifier = pipeline("zero-shot-classification")

firstnames = df['firstname'].tolist()
lastnames = df['name'].tolist()
candidate_labels = ["English or American", "not English or American"]
hypothesis_template = "This name is {}."

results_firstnames = classifier(firstnames, candidate_labels, hypothesis_template=hypothesis_template)
results_lastnames = classifier(lastnames, candidate_labels, hypothesis_template=hypothesis_template)
df['f_english'] = [1 if i['labels'][0] == 'English or American' else 0 for i in results_firstnames ]
df['n_english'] = [1 if i['labels'][0] == 'English or American' else 0 for i in results_lastnames]
df

Output:

|    | firstname   | name      |   f_english |   n_english |
|---:|:------------|:----------|------------:|------------:|
|  0 | James       | Brown     |           1 |           1 |
|  1 | Gerhard     | Schroeder |           0 |           0 |
|  2 | Musa        | Bemba     |           0 |           0 |
|  3 | Morris D.   | Kemba     |           1 |           0 |
|  4 | Evelyne     | Fontaine  |           1 |           0 |
|  5 | Max D.      | Kpali Jr. |           1 |           0 |
|  6 | Musa        | Black     |           0 |           0 |
RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26
  • I've updated the code sample so you can now run it in one go. I'm using pandas to process the csv, you can use pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to load the csv like this: `df=pd.read_csv('/Users/marcelcampion/Desktop/names.csv')`; you can check if it has loaded correctly using `df.head()`. If there are no column names visible you can set them like this: `df.columns = ['firstname', 'name']` – RJ Adriaansen Jan 30 '21 at 18:47
  • I have tried to use this code however this is what I get: any idea on how to fix this? None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. Traceback (most recent call last): – Marcel Campion Jan 31 '21 at 10:11
  • It seems to be [this](https://github.com/apple/tensorflow_macos/issues/144) issue, and [this](https://id2thomas.medium.com/apple-silicon-experiment-1-installing-huggingface-transformers-2e45392d3d0f) solution is offered. But these are system-specific issues, not code related. You can always use google colab. – RJ Adriaansen Jan 31 '21 at 10:19
0

I built something a while back that is quite similar. Summary below.

  1. Created 2 Source lists a Firstname list, and a lastname
  2. Created 4+ Comparison lists (English Firstname list, English Last name list, et. al)
  3. Then used an in_array function to compare a source first name to comparison first name
  4. Then I used a big if statement to check lists against eachother. Eng.First vs Src.First, American.First vs Src.First, Irish.First vs src.First.

and so on. If you are thinking of using your first bullet as an option (e.g. parts and pieces of a name, I wrote a paper which includes some source code as well that may be able to help.

Ordered Match Ratio as a Method for Detecting Program Abuse / Fraud