Detect strings with non English characters in Python

Question

I have some strings that have a mix of English and none English letters. For example:

w='_1991_اف_جي2'

How can I recognize these types of string using Regex or any other fast method in Python?

I prefer not to compare letters of the string one by one with a list of letters, but to do this in one shot and quickly.

maybe use the ascii range since ascii os only english characters in the range of 0-255 i believe — jgr208, Nov 23 '14 at 01:34
Checkout [this answer](http://stackoverflow.com/a/8689826/378704). Don't forget to upvote that answer and the question :) — , Nov 23 '14 at 01:40

score 114 · Accepted Answer · edited Feb 16 '19 at 19:28

114

You can just check whether the string can be encoded only with ASCII characters (which are Latin alphabet + some other characters). If it can not be encoded, then it has the characters from some other alphabet.

Note the comment # -*- coding: ..... It should be there at the top of the python file (otherwise you would receive some error about encoding)

# -*- coding: utf-8 -*-
def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

assert not isEnglish('slabiky, ale liší se podle významu')
assert isEnglish('English')
assert not isEnglish('ގެ ފުރަތަމަ ދެ އަކުރު ކަ')
assert not isEnglish('how about this one : 通 asfަ')
assert isEnglish('?fd4))45s&')

edited Feb 16 '19 at 19:28

Ivan

9,089
4
61
74

answered Nov 23 '14 at 01:45

Salvador Dali

214,103
147
703
753

11

Thanks for the answer. In Python 3 what you said was not working correctly, buy I used what you suggested and replaced `s.decode('ascii')` with `s.encode('ascii') and also `UnicodeDecodeError` with `UnicodeEnecodeError` and then it worked. – TJ1 Nov 23 '14 at 06:34
1

I was indeed using Python2 to test my code. Thanks for improving the solution for python3 – Salvador Dali Nov 23 '14 at 06:56
3

I edited this answer to work with both python 2 and 3. – Jonas Adler Jul 31 '17 at 12:32
3

@TJ1, your approach is correct for Python3, you just have a typo - it's ```UnicodeEncodeError``` – tsveti_iko Nov 07 '17 at 10:36

score 45 · Answer 2 · answered Dec 18 '19 at 11:32

45

IMHO it is the simpliest solution:

def isEnglish(s):
  return s.isascii()

print(isEnglish("Test"))
print(isEnglish("_1991_اف_جي2"))

Output:
True
False

answered Dec 18 '19 at 11:32

Torello

936
11
15

6

`isascii` was introduced in python 3.7. so minimum requirement to use this function you must have >= python 3.7 – Kaushal Sep 15 '20 at 14:06

Katerina · Answer 3 · 2016-09-30T14:14:55.427

If you work with strings (not unicode objects), you can clean it with translation and check with isalnum(), which is better than to throw Exceptions:

import string

def isEnglish(s):
    return s.translate(None, string.punctuation).isalnum()


print isEnglish('slabiky, ale liší se podle významu')
print isEnglish('English')
print isEnglish('ގެ ފުރަތަމަ ދެ އަކުރު ކަ')
print isEnglish('how about this one : 通 asfަ')
print isEnglish('?fd4))45s&')
print isEnglish('Текст на русском')

> False
> True
> False
> False
> True
> False

Also you can filter non-ascii characters from string with this function:

ascii = set(string.printable)   

def remove_non_ascii(s):
    return filter(lambda x: x in ascii, s)


remove_non_ascii('slabiky, ale liší se podle významu')
> slabiky, ale li se podle vznamu

Hi, while this solution looks nice (I would like to avoid exceptions whenever it is possible), it does not recognize all english characters. Even `Space` is not recognized. — jottbe, Aug 09 '19 at 08:14

roi3363 · Answer 4 · 2020-03-14T21:35:41.500

I believe this one would have a minimal runtime since it stops once it finds a character which is not a latin letter. It also uses a generator for better memory usage.

import string

def has_only_latin_letters(name):
    char_set = string.ascii_letters
    return all((True if x in char_set else False for x in name))

>>> has_only_latin_letters('_1991_اف_جي2')
False
>>> has_only_latin_letters('bla bla')
True
>>> has_only_latin_letters('blä blä')
False
>>> has_only_latin_letters('저주중앙초등학교')
False
>>> has_only_latin_letters('also a string with numbers and punctuation 1, 2, 4')
True

You can also use a different set of characters:

>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'

>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

>>> string.digits
'0123456789'

>>> string.digits + string.lowercase
'0123456789abcdefghijklmnopqrstuvwxyz'    

>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%& 
\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

To add latin accented letters, you can refer to this post.

score 6 · Answer 5 · edited May 16 '18 at 07:58

6

import re

english_check = re.compile(r'[a-z]')

if english_check.match(w):
    print "english",w
else:
    print "other:",w

edited May 16 '18 at 07:58

Roman-Stop RU aggression in UA

14,905
3
48
53

answered May 16 '18 at 07:08

PemaGrg

722
7
5

6

What about words like `naïve` or `cliché`? – Maximilian Peters May 16 '18 at 07:57
Contrary to the accepted answer this also works for strings with accents :-) (I tested with ['tele', 'tèle', 'τήλε'] and the results are [True, True, False].) – Frank Mar 29 '23 at 20:41

score 0 · Answer 6 · answered Jul 16 '19 at 12:33

0

w.isidentifier()

You can easily see the method in docs:

Return true if the string is a valid identifier according to the language definition, section Identifiers and keywords.

answered Jul 16 '19 at 12:33

Furkan

505
4
12

Detect strings with non English characters in Python

6 Answers6

Linked

Related