Tokenizing non English Text in Python

Question

I have a Persian text file that has some lines like this:

 ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف

I want to generate a list of words from this line. For me the word borders are numbers, like 6, 7, etc in the above line and also ، character. so the list should be:

[ 'ذوب','خوی','بزاق','آب‌دهان','یم','زهاب','آبرو','حیثیت' ,'شرف']

I want to do this in Python 3.3. What is the best way of doing this, I really appreciate any help on this.

EDIT:

I got a number of answers but when I used them for another test case they didn't work. The test case is this:

منهدم کردن : 1 خراب کردن، ویران کردن، تخریب کردن 2 نابود کردن، از بین بردن

and I expect to have a list of tokens as this:

['منهدم کردن','خراب کردن', 'ویران کردن', 'تخریب کردن','نابود کردن', 'از بین بردن']

falsetru · Accepted Answer · 2014-01-09T15:36:02.080

3

Using regex package:

>>> import regex
>>> text = 'ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف'
>>> regex.findall(r'\p{L}+', text.replace('\u200c', ''))
['ذوب', 'خوی', 'بزاق', 'آبدهان', 'یم', 'زهاب', 'آبرو', 'حیثیت', 'شرف']

The text contains ZERO WIDTH NON-JOINER (U+200C). removed the character using str.replace.
\p{L} or \p{Letter} matches any kind of letter from any language.

See Regex Tutorial - Unicode Characters and Properties.

UPDATE

To also include U+200C, use [\p{Cf}\p{L}]+ instead (\p{Cf} or \p{Format} matches invisible formatting character):

>>> regex.findall(r'[\p{Cf}\p{L}]+', text)
['ذوب', 'خوی', 'بزاق', 'آب\u200cدهان', 'یم', 'زهاب', 'آبرو', 'حیثیت', 'شرف']

It looks diffent from what you want, but they are equal:

>>> got = regex.findall(r'[\p{Cf}\p{L}]+', text)
>>> want = [ 'ذوب','خوی','بزاق','آب‌دهان','یم','زهاب','آبرو','حیثیت' ,'شرف']
>>> print(want)
['ذوب', 'خوی', 'بزاق', 'آب\u200cدهان', 'یم', 'زهاب', 'آبرو', 'حیثیت', 'شرف']
>>> got == want
>>> got[:3]
['ذوب', 'خوی', 'بزاق']
>>> got[4:]
['یم', 'زهاب', 'آبرو', 'حیثیت', 'شرف']

UPDATE2

Some words in the edited question contains a space.

>>> ' ' in 'منهدم کردن'
True

I added \s in the following code to also match the spaces, then strip the leading, trailing spaces from the matched strings, then filtered out empty strings.

>>> text = 'منهدم کردن : 1 خراب کردن، ویران کردن، تخریب کردن 2 نابود کردن، از بین بردن'
>>> want = ['منهدم کردن','خراب کردن', 'ویران کردن', 'تخریب کردن','نابود کردن', 'از بین بردن']
>>> [x for x  in map(str.strip, regex.findall(r'[\p{Cf}\p{L}\s]+', text)) if x] == want
True

edited Jan 09 '14 at 15:36

answered Jan 09 '14 at 04:53

falsetru

357,413
63
732
636

Thanks for the answer, again your answer does not exactly generate what it should be. as for example `آب‌دهان` should be one term, but your code generated `آب` and `دهان` as two separate terms. – TJ1 Jan 09 '14 at 04:56
U+200c is contained both in the input string and output array. I believe it should be there, although perhaps the terminal should not be printing it. – Peter Gibson Jan 09 '14 at 05:19
@falsetru is your answer for Python 3.3 or it is for Pyhon 2? I run it in Python 3 and it returns an empty list. – TJ1 Jan 09 '14 at 05:25
@TJ1, This for Python 3.3. Did you use `regex` package I linked in the answer? (not builtin `re`) – falsetru Jan 09 '14 at 05:28
@TJ1, See [the screencast](http://asciinema.org/a/7111) I just recorded. I tested this in Windows 7 and in Ubuntu 13.10. – falsetru Jan 09 '14 at 05:31
Yes after installing `regex` now it works. This is exactly what I wanted, thanks for the answer. – TJ1 Jan 09 '14 at 05:35
@falsetru However, after careful looking what you have generated is `آبدهان`, and in original line it is `آب‌دهان`, so they are not exactly the same! – TJ1 Jan 09 '14 at 05:42
1

@TJ1, just don't strip out the u200c - it's supposed to be there and will not be printed when the actual string is printed (as opposed to its representation) – Peter Gibson Jan 09 '14 at 06:32
@falsetru thanks for the help, but for another test case your answer does not work. I updated the question. – TJ1 Jan 09 '14 at 15:17
@PeterGibson thanks for the help, but for another test case your answer does not work. I updated the question. – TJ1 Jan 09 '14 at 15:18
@TJ1, In the expected list, the first item string contains a space. Is that right? – falsetru Jan 09 '14 at 15:30
1

@TJ1, Try `[x for x in map(str.strip, regex.findall(r'[\p{Cf}\p{L}\s]+', text)) if x]` – falsetru Jan 09 '14 at 15:33
@falsetru thank you very much, now this one works great. Can you please explain how it works? and also please update your answer so I can accept it. – TJ1 Jan 09 '14 at 15:37
@TJ1, I updated the answer with explanation. I wish that's clear to you. – falsetru Jan 09 '14 at 15:39

score 1 · Answer 2 · edited May 23 '17 at 12:13

Use re.split to split on whitespace (\s), digits (\d) and the ، character.

# python 3
import re
INPUT = 'ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف'
EXPECTED = [ 'ذوب','خوی','بزاق','آب‌دهان','یم','زهاب','آبرو','حیثیت' ,'شرف'] 

OUTPUT = re.split('[\s\d،]+', INPUT)
assert OUTPUT == EXPECTED
print('\n'.join(OUTPUT))

Note the \u200c you are seeing in the output array is a non-printing character, and is actually contained in the original string. Python is escaping it as it is showing the representation of the array and contained strings, not printing the string for display. Here's the difference:

INPUT = 'ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف'
print(INPUT)
ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف

print(repr(INPUT)) # notice the \u200c below
'ذوب 6 خوی 7 بزاق ،آب\u200cدهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف'

print(['in', 'an', 'array', INPUT]) # the \u200c is also shown when printing an array
['in', 'an', 'array', 'ذوب 6 خوی 7 بزاق ،آب\u200cدهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف']

This is similar to how python handles newline characters:

>>> 'new\nline'
'new\nline'
>>> print 'new\nline'
new
line

Edit:

Here is the regex for your updated sample that uses falsetru's findall strategy, but uses the built-in re module:

OUTPUT = [s.strip() for s in re.findall(r'(?:[^\W\d_]|[\s])+', INPUT) if s.strip()]

The pattern (?:[^\W\d_]|[\s])+ is a little strange, as Python's re module has no equivalent to regex's "Letters" \p{L}, so instead we use the solution proposed here https://stackoverflow.com/a/8923988/66349

[^\W\d_] - (not ((not alphanumeric) or digits or underscore))

So in summary, match one or more characters (+) that are either (|): Unicode letters [^\W\d_, or whitespace \s.

falsetru's method is probably more readable, but requires the 3rd party library.

Here is what I get with this: `['ذوب', 'خوی', 'بزاق', 'آب\u200cدهان', 'یم', 'زهاب', 'آبرو', 'حیثیت', 'شرف']` — TJ1, Jan 09 '14 at 05:52
@TJ1 python is showing a representation of the non-printing character in the string - see my updated answer. Try printing that string for yourself (not the array) — Peter Gibson, Jan 09 '14 at 06:07
Peter:when I tried what you suggested for another example it did not work at all. The example is here: INPUT = `'منهدم کردن : 1 خراب کردن، ویران کردن، تخریب کردن 2 نابود کردن، از بین بردن'`. I expect to get tokens as for example `1 خراب کردن`, but I get `کردن` and ` خراب` as two separate tokens. — TJ1, Jan 09 '14 at 15:03

Tokenizing non English Text in Python

2 Answers2

Linked