find non English characters in python string

Question

I am collecting strings that may have writing of other languages in it and I want to find all strings that contain non English characters.

for example

lst = ['english1234!', 'Engl1sh', 'not english 行中ワ']

You may need to better define "English characters". Is "café" english? — wim, Feb 10 '21 at 07:32
Would this artcle help you: https://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii — ex4, Feb 10 '21 at 07:32
@GinoMempin not a good duplicate. The accepted answer assumes Python 2 is used (Python3 strings are Unicode and `encode()` uses UTF8 by default). The best answer is to use `isascii` but even that would fail with many English words that use characters outside the 7-bit US-ASCII range. — Panagiotis Kanavos, Feb 10 '21 at 09:16
You have to decide what `English characters` means, and how many false negatives you can live with. `isascii()` is perhaps the easiest, but will choke on many English words. Beyond that, you can use [a regex](https://stackoverflow.com/a/4316097/134204) matching specific character classes or groups, and accept eg words that only have Latin characters with eg `[^\p{IsLatin}]`. — Panagiotis Kanavos, Feb 10 '21 at 09:24
@PanagiotisKanavos Aside from the accepted answer on that duplicate, there are 7 *other* answers, some of which are relatively new like [this one](https://stackoverflow.com/a/59391135/2745495) that assumes Python 3 (>=3.7) and uses `isascii`. The presence of an accepted answer there doesn't block anyone from posting newer/better solutions, especially to address the "What an English character means" question. — Gino Mempin, Feb 10 '21 at 10:21

tbjorch · Answer 1 · 2021-02-10T09:05:05.833

1

Depends on what you mean with "non-english" characters. If you are only allowing characters a-z you could use the string method "isalpha".

lst = ['english1234!', 'Engl1sh', 'not english 行中ワ']
allowed_strings = [string for string in lst if string.isalpha()]

If alphanumeric is allowed, use string.isalnum()
If alphanumeric + standard special characters, you could use string.isascii()
If any other specific scenarios is allowed, use regex.

e.g. in your example if using isascii() in the list comprehension above, you would remove the last string ut keep the first 2.

edited Feb 10 '21 at 09:05

answered Feb 10 '21 at 07:34

tbjorch

1,544
1
8
21

Are you sure about `isascii`? The 7-bit US-ASCII range doesn't have any letters with diacritics so even `café` and many English names can't be represented by it, eg `Brontë` – Panagiotis Kanavos Feb 10 '21 at 09:11
Hmm yes that is true, in that case isascii will not work either. If none of the available str methods provides what's necessary i would go for regex. – tbjorch Feb 10 '21 at 09:35

score 0 · Answer 2 · answered Feb 10 '21 at 07:39

0

If you want to also have special character, you cannot use isAlpha() alone, but perhaps that's a start. (it won't accept "hi!" or "hi here")

answered Feb 10 '21 at 07:39

Zartant

109
9

1

True. iascii allows alphanum + standard special characters with some deviations. that might be a better option so appended it to my answer above. – tbjorch Feb 10 '21 at 07:43

score 0 · Answer 3 · answered Feb 10 '21 at 09:36

First you need to decide what English character means. Do you want to reject words like café or naïve?

If you want only A-Z or A-Z and numbers you can use str.isalpha() or str.isalnum(). You can't use str.isascii() in your case, as the 7-bit US-ASCII range doesn't include any accented characters, just some extra symbols.

To include accented characters you can use a regular expression using the regex package and match against specific Unicode scripts or character blocks. For example, \p{IsLatin} will match all characters in the Latin1 script.

To find strings with non-English words you can use [^\p{IsLatin}]:

regex.match(r'[^\{IsLatin}]', 'not english 行中ワ')

find non English characters in python string

3 Answers3