How to find Unicode Pattern using Regex in Python3.7?

Question

I am trying to find a Unicode pattern but it always returns an empty list [ ]. I have tried the same pattern in Kwrite and it worked fine.

I have tried \u \\u in place of \w but didn't work for me. Here Unicode string can be any Unicode string.

InputString=r"[[ਅਤੇ\CC_CCD]]_CCP"

Result = re.findall(r'[\[]+[\w]+\\\w+[\]]+[_]\w+',InputString,flags=re.U)

print(Result)

Gurmanjot Singh · Accepted Answer · 2019-01-12T07:02:19.043

1

There seems to be an extra character ੇ between ਤ and \ which cannot be matched by \w+. It's hex value is 0xA47 So, I have added [\u0A47] in the regex.

Try this Regex:

\[+\w+[\u0A47]\\\w+]]\w+

Click for Demo

Explanation:

\[+ - matches 1+ occurrences of [
\w+ - matches 1+ occurrences of a word character
[^\\]* - matches 0+ occurrences of any character which is not \
\\ - matches \
\w+ - matches 1+ occurrences of a word character
]] - matches ]]
\w+ - matches 1+ occurrences of a word character

Python code

The words are from Gurmukhi language. The unicode range is 0A00 - 0A7F. So you can also use the regex:

\[+[\u0A00-\u0A7F]+\\\w+]]\w+

Click for Demo

edited Jan 12 '19 at 07:02

answered Jan 12 '19 at 06:24

Gurmanjot Singh

10,224
2
19
43

this works, can explain it why need a '.' after \w. and if we have multiple patterns in a string it returns only first. Link (https://regex101.com/r/vv2Qzl/4) – UMR Jan 12 '19 at 07:00
@UMR See the full updated answer for the explanation. – Gurmanjot Singh Jan 12 '19 at 07:01
@UMR See the 2nd regex I have posted in the answer. It will match all the gurmukhi characters. – Gurmanjot Singh Jan 12 '19 at 07:05
Thanks, the Second one worked fine for me. It matched all the patterns from given text. – UMR Jan 12 '19 at 07:09

How to find Unicode Pattern using Regex in Python3.7?

1 Answers1