3

I have a list of ids which are like:

A5ukur+de2.008x١٥١١٠٦١١٥٠٥٢٤٦٢

and I have written the following commands that is using the group name,

>>> RE_SID = re.compile(ur'(?P<sid>(?<=sid:)([A-Za-z0-9+.\u0627-\u064a]+))',re.UNICODE)
>>> x = RE_SID.search('sid:A5ukur+de2.008x١٥١١٠٦١١٥٠٥٢٤٦٢">>')
>>> x.group('sid')
'A5ukur+de2.008x'

However this is not working when Persian/Arabic alphabet is combining with the latin characters and this is returning me only A5ukur+de2.008x.

I will be appreciated for any help.

pm1359
  • 622
  • 1
  • 10
  • 31
  • What exactly is the pattern? – Maroun Feb 23 '16 at 12:58
  • 2
    I believe this question/answer will help you: http://stackoverflow.com/questions/10546442/python-regular-expression-with-utf8-issue - basically use `re.compile(ur'...', re.UNICODE)` – udondan Feb 23 '16 at 12:59
  • 3
    See http://stackoverflow.com/questions/27685984/using-range-in-regex-for-arabic-letters. Is it Python 2 or 3? – Wiktor Stribiżew Feb 23 '16 at 13:00
  • There can be some issues with Arabic as their Unicode Pages not not totally contiguous. You may also want to look at the Unicode Definitions as some of the letters are VERY locale specific. – Tim Seed Feb 23 '16 at 13:04
  • I have use `\u0627-\u064a` inside my pattern like `([A-Za-z0-9\u0627-\u064a._+]*)`, but doesn't work. and in my case, is a combination of persian and latin letter. The version is 2.7.10 @Wiktor Stribiżew – pm1359 Feb 23 '16 at 14:43
  • If you use it with `re.U` option, everything will work. Also, just `.encode/decode("utf8")` every time you access the data. – Wiktor Stribiżew Feb 23 '16 at 14:50
  • Would u please to write it with the correct statement? @WiktorStribiżew – pm1359 Feb 23 '16 at 15:21
  • See [this answer of mine](http://stackoverflow.com/questions/33127900/can-the-a-za-z-python-regex-pattern-be-made-to-match-and-replace-non-ascii-uni/33128359#33128359), and [a similar one here](http://stackoverflow.com/questions/32863608/regex-python-with-unicode-japanese-character-issue/32868484#32868484). [Another one](http://stackoverflow.com/questions/34672071/python-2-re-sub-issue/34672544#34672544), a bit more complex. [Unicode with `re.finditer`](http://stackoverflow.com/questions/32207449/discover-identically-adjacent-strings-with-regex-and-python/32207506#32207506) here. – Wiktor Stribiżew Feb 23 '16 at 15:22
  • @WiktorStribiżew I have tried it and I have updated my question, I am not sure how it would be worked!! – pm1359 Feb 23 '16 at 15:54
  • I checked: the regex just does not catch these symbols. See [this demo](https://regex101.com/r/yE3sI5/1). What are these characters? – Wiktor Stribiżew Feb 23 '16 at 16:22
  • 1
    Use `(?P(?<=sid:)([A-Za-z0-9+.\u0621-\u06FF]+))`. Check [this IDEONE demo](https://ideone.com/KSEkhk). `\u06FF` is *Arabic Letter Heh with inverted V*. See [this answer of mine](http://stackoverflow.com/a/35581928/3832970). Please check what range you are interested in from that table. – Wiktor Stribiżew Feb 23 '16 at 16:32
  • What is this u before `test_str = u"sid:A5ukur+de2.008x١٥١١٠٦١١٥٠٥٢٤٦٢\">>'"`? Is that need to use `u`, because I am extractig the id from log, and this seems to me that I have to join it with `u`, right? @WiktorStribiżew – pm1359 Feb 23 '16 at 16:50
  • If you extract the string from somwhere, it must be also in Unicode. u means a Unicode byte string. – Wiktor Stribiżew Feb 23 '16 at 17:07

0 Answers0