Check whether a string is alphabetical for languages other than english

Question

I am working with text data with a mix of several languages. Now trying to test whether a token/string is alphabetical, which means is potentially a word. Is there some built in function like 'somestring'.isAlpha() to test whether a string is alphabetical for other languages (Portuguese and Spanish)? I tried 'ó'.isalpha(), which returns False.

What I thought of now is to get the Unicode table. Find the starting and ending letter and test whether a letter is in the range of alphabets.

Aside: if you're working with unicode data, you should really be working in Python 3. It's much saner. — DSM, Dec 14 '15 at 19:30
`'ó'.decode("utf-8").isalpha()`, that will fail too though for certain input — Padraic Cunningham, Dec 14 '15 at 19:32

jgritty · Accepted Answer · 2015-12-14T19:28:33.397

2

Will this solve your problem?

>>> u'é'.isalpha()
True

Just as an FYI, the below example works perfectly in Python 3:

words = ['você', 'quer', 'uma', 'maçã']
for word in words:
    word.isalpha()

In python 2, you could do something like:

for word in words:
    unicode(word, "utf-8").isalpha()

edited Dec 14 '15 at 19:28

answered Dec 14 '15 at 19:20

jgritty

11,660
3
38
60

`word.isalpha()` by itself won't show the user any output, though.. maybe `print(word.isalpha())`? – DSM Dec 14 '15 at 19:31
It will in the terminal, but yes, maybe. – jgritty Dec 14 '15 at 19:31
Your Python 2 example doesn't work for me. It prints and encodes "Кири́лл" correctly, and `u"Кири́лл".isalpha()` returns `True`, but it wont work with `unicode(code, "utf-8").isalpha()`. – SuperBiasedMan Jan 20 '16 at 15:58
For some reason, both python2 and python3 think "и́" is not alpha. – jgritty Jan 20 '16 at 18:27
http://stackoverflow.com/questions/21920882/python-isalpha-doesnt-handle-unicode-combing-marks-properly – jgritty Jan 20 '16 at 18:28

SVK · Answer 2 · 2016-08-03T20:09:32.753

0

This library is not from NLTK either but certainly helps.

1) Install langdetect Library $ pip install langdetect

Supported Python versions 2.6, 2.7, 3.x.

2) Test your code

>>> from langdetect import detect

>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'

Reference Link :

https://pypi.python.org/pypi/langdetect?

edited Aug 03 '16 at 20:09

answered Aug 03 '16 at 20:02

SVK

1,004
11
25

Check whether a string is alphabetical for languages other than english

2 Answers2