0

I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.

example (the word "شرکت" means "company" and we want to extract what the company name is):

input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس

I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:

re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None

1 Answers1

2

Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:

re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")

Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.

Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:

from __future__ import unicode_literals

will make unicode literals be parsed correctly with no u"".

kabanus
  • 24,623
  • 6
  • 41
  • 74