I have a text file that contains urdu words. I have to remove its duplicates. For that it is required that those words should be sorted. In english, its not a problem but when I follow same of urdu, then it became a problem (errors). For test case, suppose my text file contains floowing words (one word in each line for simplicity)
جنگ
بندی
اس
کروانا
سات
اس
سات
Following is the code and error.
[example@localhost compare]$ ./get_urdu_words.py |sort
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
Here get_urdu_words.py is a script that extract urdu words from a urdu/English mix file and sort is bash command.
This is the code of get_urdu_words.py
import re
test=u"جنگ بندی اس کروانا سات اس سات"
token=test.split(' ')
for w in token:
status=re.search(ur'[\u0600-\u06ff]+',w)
if status:
print w
This question is specific to urdu language only