1

I am trying to find a Unicode pattern but it always returns an empty list [ ]. I have tried the same pattern in Kwrite and it worked fine.

I have tried \u \\u in place of \w but didn't work for me. Here Unicode string can be any Unicode string.

InputString=r"[[ਅਤੇ\CC_CCD]]_CCP"

Result = re.findall(r'[\[]+[\w]+\\\w+[\]]+[_]\w+',InputString,flags=re.U)

print(Result)
UMR
  • 39
  • 1
  • 8

1 Answers1

1

There seems to be an extra character between and \ which cannot be matched by \w+. It's hex value is 0xA47 So, I have added [\u0A47] in the regex.

Try this Regex:

\[+\w+[\u0A47]\\\w+]]\w+

Click for Demo

Explanation:

  • \[+ - matches 1+ occurrences of [
  • \w+ - matches 1+ occurrences of a word character
  • [^\\]* - matches 0+ occurrences of any character which is not \
  • \\ - matches \
  • \w+ - matches 1+ occurrences of a word character
  • ]] - matches ]]
  • \w+ - matches 1+ occurrences of a word character

Python code

The words are from Gurmukhi language. The unicode range is 0A00 - 0A7F. So you can also use the regex:

\[+[\u0A00-\u0A7F]+\\\w+]]\w+

Click for Demo

Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43