I have a csv file which has two columns and about 9,000 rows. Column 1 contains the firstname of a respondent in a survey, column 2 contains the lastname of a respondent in a survey, so each row is an observation.
These surveys were conducted in a very diverse place. I am trying to find a way to tell, whether a respondent's firstname is of English (British or American) origin or not. Same for his lastname.
This task is very far away from my area of expertise. After reading interesting discussions online here, and here. I have thought about three way:
1- Take a dataset of the most common triplets (families of 3 letters often found together in English) or quadruplets (families of 4 letters often found together in English) and to check for each firstname, and lastname, whether it contains these families of letters.
2- Use a dataset of British names (say the most X common names in the UK in the early XX Century, and match these names based on proximity to my dataset. These datasets could be good I think, data1, data2, data3.
3- Use python and an interface to detect what is (most likely) English from what is not.
If anyone has advise on this, can share experience etc that would be great!
I am attaching an example of the data (I made up the names) and of the expected output.
NB: Please note that I am perfectly aware that classifying names according to an English/Non English dichotomy is not without drawbacks and semantic issues.