Given a bunch of names, how can we find out which are Chinese names and which are English names? For the Chinese names, I build a list of the Chinese last names to find out the Chinese names. For example, Bruce Lee, Lee is a Chinese last name, so we regard Bruce Lee is a Chinese name. However, the Chinese last names list is large. Is there any better way to do it? If you are not familiar with the Chinese name, you can tell how you will distinct the English names from some other names, like French names, Italian names, etc.
-
3Why would Bruce Lee be an exclusively Chinese name? – EdChum Apr 28 '17 at 16:14
-
@EdChum Ummm...Maybe I'm wrong. Can we consider people using this name with Asian characteristic would be Chinese, ABC(American Born Chinese) or CBC(Canadian Born Chinese) something like that? I don't see white people or black people use such name. – Gang Apr 28 '17 at 16:19
-
I'm a bit confused by the way the question is phrased. Are you A) positing two existing lists, one with English names and one with Chinese names, then checking to see if your variable lies within those lists, or B) attempting to evaluate whether the name is Chinese or English based on characteristics of the provided string? – lysdexia Apr 28 '17 at 16:29
-
@lysdexia Well, actually, we are crawling the websites of the top 100 universities and trying to find all the Chinese professors. You know, those professors are from worldwide. So how can you recognise which are Chinese professors? – Gang Apr 28 '17 at 16:38
-
@Gang The name of the professor does not necessarily imply his/her nationality. What if you have a name like Bruce David Lee? It's a guessing game. – GIZ Apr 28 '17 at 16:45
-
I mean, not necessary to be totally Chinese. We can extend to scope to blood relation. Like Jeremy Lin, I believe his nationality is American. But at the same time, he is an Asian American. Himself doesn't deny he is a Taiwanese. So we are reasonable to regard this is a Chinese name. – Gang Apr 28 '17 at 16:59
-
This project is ill-conceived. "Bruce Lee" is the actor's professional English name. His real Chinese name is Lee Jun-fan. East Asian family names come first, not last. People in a majority English speaker country may or may not reverse the order to make an English name. And by the way, 'Lee' is also an American last name. You only think that calling 'Bruce Lee' Chinese is correct because you associate it with a particular person. Doing the same with 'Robert (E.) Lee' would be a blunder. And there well could be an very American 'Bruce Lee'. – Terry Jan Reedy Apr 28 '17 at 17:21
-
Yea, that's the problem. Since I'm not familiar with the western culture. I can't 100% confirm Lee is a Chinese last name. Maybe this cannot be done perfectly. However, how can we enhance the accuracy as far as possible? – Gang Apr 28 '17 at 17:30
2 Answers
If you have the lists of typical Chinese and English names and the problem is performance only, I suggest you convert the lists into sets and then ask for membership in both sets as this is much faster than finding out whether an element is present in a large list.

- 1,132
- 8
- 20
-
This is helpful. But the main problem is accuracy. Actually, I'm not in charge of this part. But I saw my partner run his code. It found out a name looks like Chinese. The reason I say looks like is because the spelling of names from mainland China, Hongkong and Taiwan is a little bit different. I'm not quite sure. However, the profile image is obviously a white guy without any Asian characteristic. – Gang Apr 28 '17 at 16:34
Well, that's a pickle.
If the professor's names were written in Chinese, the obvious answer would be to check each character in the name. This answer gives us a clue that many commonly-used unicode "chinese" characters are in the range 19968 - 40959.
Thus:
def is_chinese(var):
if ord(var) >= 19968 and ord(var) <= 40959:
return True
If your hypothetical Chinese professors have their names written using characters in those ranges somewhere in their bio, you need only search for a few characters in that range to get a reasonable answer.
However, if you already have a list of Chinese names, @SheepPerplexed has probably supplied the quickest way.
-
No, the websites are all written in English, include the professors' names. – Gang Apr 28 '17 at 17:03
-
Ah. Well, it may be time to get fancy and try a [Bayesian network](https://www.coursera.org/learn/probabilistic-graphical-models). – lysdexia Apr 28 '17 at 17:08
-
-
One could use [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to check for variants in English spelling of a list of common surnames. – lysdexia Apr 28 '17 at 17:15