0

I am working with text data with a mix of several languages. Now trying to test whether a token/string is alphabetical, which means is potentially a word. Is there some built in function like 'somestring'.isAlpha() to test whether a string is alphabetical for other languages (Portuguese and Spanish)? I tried 'ó'.isalpha(), which returns False.

What I thought of now is to get the Unicode table. Find the starting and ending letter and test whether a letter is in the range of alphabets.

Bin
  • 3,645
  • 10
  • 33
  • 57

2 Answers2

2

Will this solve your problem?

>>> u'é'.isalpha()
True

Just as an FYI, the below example works perfectly in Python 3:

words = ['você', 'quer', 'uma', 'maçã']
for word in words:
    word.isalpha()

In python 2, you could do something like:

for word in words:
    unicode(word, "utf-8").isalpha()
jgritty
  • 11,660
  • 3
  • 38
  • 60
  • `word.isalpha()` by itself won't show the user any output, though.. maybe `print(word.isalpha())`? – DSM Dec 14 '15 at 19:31
  • It will in the terminal, but yes, maybe. – jgritty Dec 14 '15 at 19:31
  • Your Python 2 example doesn't work for me. It prints and encodes "Кири́лл" correctly, and `u"Кири́лл".isalpha()` returns `True`, but it wont work with `unicode(code, "utf-8").isalpha()`. – SuperBiasedMan Jan 20 '16 at 15:58
  • For some reason, both python2 and python3 think "и́" is not alpha. – jgritty Jan 20 '16 at 18:27
  • http://stackoverflow.com/questions/21920882/python-isalpha-doesnt-handle-unicode-combing-marks-properly – jgritty Jan 20 '16 at 18:28
0

This library is not from NLTK either but certainly helps.

1) Install langdetect Library $ pip install langdetect

Supported Python versions 2.6, 2.7, 3.x.

2) Test your code

>>> from langdetect import detect

>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'

Reference Link :

https://pypi.python.org/pypi/langdetect?

SVK
  • 1,004
  • 11
  • 25