0

I have lot of files that contains urdu and english. I have to search for those words that are in urdu only. For english, I know its not a problem by using regular expression. r'[a-zA-Z]' But How I can use regular expression for urdu language.

Suppose this is the string

test="working جنگ test  بندی کروانا not good"

Please guide.

Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
  • 7
    http://stackoverflow.com/questions/14389/regex-and-unicode – vaultah Nov 11 '15 at 09:44
  • 2
    "Please solve it and guide?" . Do you also want a coffee? – Colonel Beauvel Nov 11 '15 at 09:48
  • 1
    Since not every StackOverflow user know the alphabet of Urdu language, you should at least specify the list of characters you want to consider to be part of an Urdu word, plus possible accent (if applicable). – nhahtdh Nov 11 '15 at 09:54

2 Answers2

8

Applying this question to urdu using this information it would seem that this is the solution:

Indian-arabic digit codepoints: U+0660 - U+0669

Arabic letter codepoints: U+0600 - U+06FF

In python3 this is really easy:

Use this expression:

r'[\u0600-\u06ff]'

Example:

>>> test="working جنگ test  بندی کروانا not good"
>>> test
'working جنگ test  بندی کروانا not good'
>>> import re
>>> re.findall(r'[\u0600-\u06ff]',test)
['ج', 'ن', 'گ', 'ب', 'ن', 'د', 'ی', 'ک', 'ر', 'و', 'ا', 'ن', 'ا']

By adding a + once-or-more operator, you can get complete words.

>>> re.findall(r'[\u0600-\u06ff]+',test)
['جنگ', 'بندی', 'کروانا']

Update for python 2.7 working

In python 2.x unicode is difficult. you have to prefix the regex with ru to mark it as unicode then it will find the correct glyphs. Also your first line in the script should be

`# -*- coding: utf-8 -*-`
test=u"working جنگ test  بندی کروانا not good"
myurdu="".join([unicode(letter) for letter in re.findall(ur'[\u0600-\u06ff]',test)])
print myurdu
>>> 
جنگبندیکروانا

For more information I refer you to declaring an encoding and unicode support in python. Consider switching to python3 because unicode support is mmuch better there if you are going to handle a lot of urdu.

Community
  • 1
  • 1
Sebastian Wozny
  • 16,943
  • 7
  • 52
  • 69
0

Another Style of solving your problem

import re    
test=u"working جنگ test  بندی کروانا not good"
token=test.split(' ')
for w in token:
  status=re.search(ur'[\u0600-\u06ff]+',w)
  if status:
      print w

It should work for python version 2.7

Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121