How to sort urdu words in python or bash

Question

I have a text file that contains urdu words. I have to remove its duplicates. For that it is required that those words should be sorted. In english, its not a problem but when I follow same of urdu, then it became a problem (errors). For test case, suppose my text file contains floowing words (one word in each line for simplicity)

جنگ
بندی
 اس
کروانا
سات
 اس
سات

Following is the code and error.

[example@localhost compare]$ ./get_urdu_words.py |sort

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Here get_urdu_words.py is a script that extract urdu words from a urdu/English mix file and sort is bash command.

This is the code of get_urdu_words.py

import re
test=u"جنگ بندی  اس کروانا سات  اس سات"

token=test.split(' ')
for w in token:
 status=re.search(ur'[\u0600-\u06ff]+',w)
 if status:
  print w

This question is specific to urdu language only

What code are you trying? Does the regular `sort` function not work on Urdu? — Paul Rooney, Nov 12 '15 at 04:04
When I use `sort` on what you provided I don't get any errors (of course I don't know if they are really sorted according to the local convention or not, but I suppose with the correct locale it should work). — 4ae1e1, Nov 12 '15 at 04:05
Re update: the error is in your `get_urdu_words.py`. Nothing to do with `sort` or bash. — 4ae1e1, Nov 12 '15 at 04:07
Actually, this is the exact same problem as https://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python. — 4ae1e1, Nov 12 '15 at 04:16

score 1 · Accepted Answer · answered Nov 12 '15 at 04:21

A little modification should solve your problem.Try this one

import re
test=u"جنگ بندی  اس کروانا سات  اس سات"

token=test.split(' ')
for w in token:
 status=re.search(ur'[\u0600-\u06ff]+',w)
 if status:
  print w.encode('utf-8')

After this run this command

[example@localhost compare]$ ./get_urdu_words.py |sort

How to sort urdu words in python or bash

1 Answers1