I am collecting strings that may have writing of other languages in it and I want to find all strings that contain non English characters.
for example
lst = ['english1234!', 'Engl1sh', 'not english 行中ワ']
I am collecting strings that may have writing of other languages in it and I want to find all strings that contain non English characters.
for example
lst = ['english1234!', 'Engl1sh', 'not english 行中ワ']
Depends on what you mean with "non-english" characters. If you are only allowing characters a-z you could use the string method "isalpha".
lst = ['english1234!', 'Engl1sh', 'not english 行中ワ']
allowed_strings = [string for string in lst if string.isalpha()]
e.g. in your example if using isascii() in the list comprehension above, you would remove the last string ut keep the first 2.
If you want to also have special character, you cannot use isAlpha() alone, but perhaps that's a start. (it won't accept "hi!" or "hi here")
First you need to decide what English character means
. Do you want to reject words like café
or naïve
?
If you want only A-Z or A-Z and numbers you can use str.isalpha() or str.isalnum(). You can't use str.isascii() in your case, as the 7-bit US-ASCII range doesn't include any accented characters, just some extra symbols.
To include accented characters you can use a regular expression using the regex package and match against specific Unicode scripts or character blocks. For example, \p{IsLatin}
will match all characters in the Latin1 script.
To find strings with non-English words you can use [^\p{IsLatin}]
:
regex.match(r'[^\{IsLatin}]', 'not english 行中ワ')