Python3 unicode regex

Asked Jul 16 '17 at 10:02

Active Jul 16 '17 at 11:04

Viewed 445 times

I'm not a native English speaker, but it happened that I've never written any regex for any non-ASCII text in my life, so I'm confused with a seemingly trivial case.

I have a large dictionary scrapped from a website by a robot. All HTML tags are removed. My goal is to remove most carry over hyphens. The idea is that >90% of problematic punctuation have a form lowercase-lowercase, so they could be caught by regex like '\p{Ll}-\p{Ll}'. This should be able to capture Russian lowercase chars, при-мер for example.

However, it seems like \p isn't supported by python's re engine. I'm not sure which alternative regex engine I'm supposed to choose because googling doesn't show any information relevant to Python 3. I thought Python3 is much more advanced when it comes to i14n and Unicode, and it's supposed to have Unicode character class support.

edited Jul 16 '17 at 11:04

asked Jul 16 '17 at 10:02

Minor Threat

2,025
1
18
32

3

See this: https://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties – anubhava Jul 16 '17 at 10:06
1

Use the `regex` module instead of `re` – Lucas Trzesniewski Jul 16 '17 at 10:25
So, the python 3 re doesn't support Unicode? Is there a reason for that? Compatibility? – Minor Threat Jul 16 '17 at 10:51
Python3 supports unicode better than Python2. https://stackoverflow.com/a/1852463/886607 might be helpful. – Ahmad Yoosofan Jul 17 '17 at 11:09

Python3 unicode regex

0 Answers0