Applying this question to urdu using this information it would seem that this is the solution:
Indian-arabic digit codepoints: U+0660 - U+0669
Arabic letter codepoints: U+0600 - U+06FF
In python3 this is really easy:
Use this expression:
r'[\u0600-\u06ff]'
Example:
>>> test="working جنگ test بندی کروانا not good"
>>> test
'working جنگ test بندی کروانا not good'
>>> import re
>>> re.findall(r'[\u0600-\u06ff]',test)
['ج', 'ن', 'گ', 'ب', 'ن', 'د', 'ی', 'ک', 'ر', 'و', 'ا', 'ن', 'ا']
By adding a +
once-or-more operator, you can get complete words.
>>> re.findall(r'[\u0600-\u06ff]+',test)
['جنگ', 'بندی', 'کروانا']
Update for python 2.7 working
In python 2.x unicode is difficult. you have to prefix the regex with ru
to mark it as unicode then it will find the correct glyphs. Also your first line in the script should be
`# -*- coding: utf-8 -*-`
test=u"working جنگ test بندی کروانا not good"
myurdu="".join([unicode(letter) for letter in re.findall(ur'[\u0600-\u06ff]',test)])
print myurdu
>>>
جنگبندیکروانا
For more information I refer you to declaring an encoding and unicode support in python. Consider switching to python3 because unicode support is mmuch better there if you are going to handle a lot of urdu.