Regex to match all non-English alphabetical characters

Question

I'm not a RegEx expert but I'm trying to find a solution that returns TRUE if ANY non-English characters are present (i.e. [^A-Za-z] but like just alphabetical characters, NOT numbers or symbols).

I've tried this:

obj = re.search("[\x00-\x7F]", "ивн")
print(bool(obj))

which returns False but

obj = re.search("[\x00-\x7F]", "ив.н")
print(bool(obj))

returns True which it shouldn't -- I don't care about special characters or punctuation really. Just need a quick solution to see if a text is in another language.

I.e., if there are Cyrillic characters or Umlauts it returns true etc etc for non English scripts, else return false. Other solutions here on StackOverflow simply match non-English characters AND symbols or simply non-ASCII characters. I'm trying to scan a piece of text to see if it's not English basically. I can't find any other answers that work.

So, do the linked threads (on top) answer your questions? If not, does `r'(?![A-Za-z])[^\W\d_]'` solve your problem? — Wiktor Stribiżew, Oct 14 '20 at 08:43

score 0 · Answer 1 · answered Oct 13 '20 at 23:21

I can think of something which will do exactly what you're saying, but I'm not sure if it will work for your intended purpose.

Google Translate has a web API for which there is a workaround allowing you to use it for free. By sending a query with the "auto" setting, it will auto detect and respond with the language detected. You might just try doing this with the first few sentences of your text.

One quick and dirty way we can do exactly what you're saying is to just look at the unicode values of the characters detected, and discard if they're above a certain number. (chose the number by looking at this website.)

def is_english(string):
    for char in string:
        if ord(char) >= 127: 
            return False
    else:
        return True

This works sometimes:

>>> is_english("Hello")
True
>>> is_english("Российская")
False

But not all the time.

>>> is_english("I gave the nice man my résumé")
False
>>> is_english("Stavo andando in parco, il mio cane voleva cagare")
True

Yeah I've seen that googletrans library but my project is for a research paper and the IRB thinks using that library to skirt Google's API costs is a bit unethical lol (as do I), I'll have to look at some other answers -- I'm just trying to pare down the amount of text I actually DO have to pay Google to translate — Ryhun, Oct 14 '20 at 06:04

Regex to match all non-English alphabetical characters

1 Answers1