How to find urdu words in a string using python

Question

I have lot of files that contains urdu and english. I have to search for those words that are in urdu only. For english, I know its not a problem by using regular expression. r'[a-zA-Z]' But How I can use regular expression for urdu language.

Suppose this is the string

test="working جنگ test  بندی کروانا not good"

Please guide.

Since not every StackOverflow user know the alphabet of Urdu language, you should at least specify the list of characters you want to consider to be part of an Urdu word, plus possible accent (if applicable). — nhahtdh, Nov 11 '15 at 09:54

score 8 · Accepted Answer · edited Jun 20 '20 at 09:12

8

Applying this question to urdu using this information it would seem that this is the solution:

Indian-arabic digit codepoints: U+0660 - U+0669

Arabic letter codepoints: U+0600 - U+06FF

In python3 this is really easy:

Use this expression:

r'[\u0600-\u06ff]'

Example:

>>> test="working جنگ test  بندی کروانا not good"
>>> test
'working جنگ test  بندی کروانا not good'
>>> import re
>>> re.findall(r'[\u0600-\u06ff]',test)
['ج', 'ن', 'گ', 'ب', 'ن', 'د', 'ی', 'ک', 'ر', 'و', 'ا', 'ن', 'ا']

By adding a + once-or-more operator, you can get complete words.

>>> re.findall(r'[\u0600-\u06ff]+',test)
['جنگ', 'بندی', 'کروانا']

Update for python 2.7 working

In python 2.x unicode is difficult. you have to prefix the regex with ru to mark it as unicode then it will find the correct glyphs. Also your first line in the script should be

`# -*- coding: utf-8 -*-`
test=u"working جنگ test  بندی کروانا not good"
myurdu="".join([unicode(letter) for letter in re.findall(ur'[\u0600-\u06ff]',test)])
print myurdu
>>> 
جنگبندیکروانا

For more information I refer you to declaring an encoding and unicode support in python. Consider switching to python3 because unicode support is mmuch better there if you are going to handle a lot of urdu.

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 11 '15 at 09:51

Sebastian Wozny

16,943
7
52
69

1

Isn't `[\u0660-\u0669]` already included in `[\u0600-\u06ff]`? – Mariano Nov 11 '15 at 09:57
1

I folowed above but it is printing only english word and none urdu ? – Hafiz Muhammad Shafiq Nov 11 '15 at 10:09
1

Yes I tried both but it is showing only english. my python version is 2.7.6 – Hafiz Muhammad Shafiq Nov 11 '15 at 10:14
This is the result of print re.findall(r'[\u0600-\u06ff]',test) ['o', 'r', 'k', 'i', 'n', 'g', 't', 'e', 's', 't', 'n', 'o', 't', 'g', 'o', 'o', 'd'] – Hafiz Muhammad Shafiq Nov 11 '15 at 10:25
1

No sir, Its difficult as I have to change many other things also – Hafiz Muhammad Shafiq Nov 11 '15 at 10:39
2

@Shafiq pass the pattern as unicode as well: `ur'[\u0600-\u06ff]+'` – Mariano Nov 11 '15 at 10:44

Hafiz Muhammad Shafiq · Answer 2 · 2015-11-12T03:03:45.393

0

Another Style of solving your problem

import re    
test=u"working جنگ test  بندی کروانا not good"
token=test.split(' ')
for w in token:
  status=re.search(ur'[\u0600-\u06ff]+',w)
  if status:
      print w

It should work for python version 2.7

edited Nov 12 '15 at 03:03

answered Nov 11 '15 at 12:58

Hafiz Muhammad Shafiq

8,168
12
63
121

How to find urdu words in a string using python

2 Answers2