I have a Persian text file that has some lines like this:
ذوب 6 خوی 7 بزاق ،آبدهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف
I want to generate a list of words from this line. For me the word borders are numbers, like 6, 7, etc in the above line and also ،
character.
so the list should be:
[ 'ذوب','خوی','بزاق','آبدهان','یم','زهاب','آبرو','حیثیت' ,'شرف']
I want to do this in Python 3.3. What is the best way of doing this, I really appreciate any help on this.
EDIT:
I got a number of answers but when I used them for another test case they didn't work. The test case is this:
منهدم کردن : 1 خراب کردن، ویران کردن، تخریب کردن 2 نابود کردن، از بین بردن
and I expect to have a list of tokens as this:
['منهدم کردن','خراب کردن', 'ویران کردن', 'تخریب کردن','نابود کردن', 'از بین بردن']