Unicode Regex with regex not working in Python

Question

I have the following Regex (see it in action in PCRE)

.*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$

However, Python doesn't upport unicode regex with \p{} syntax. To solve this I read I could use the regex module (not default re), but this doesn't seem to work either. Not even with u flag.

Example:

sentence = "valt nog zoveel zal kunnen zeggen, "

print(re.sub(".*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$","\1",sentence))

Output: < blank >
Expected output: zeggen

This doesn't work with Python 3.4.3.

It works fine using the regex module, use raw string notation. If that doesn't do it, place `(?iV1)` at the beginning of your pattern as well. — hwnd, Aug 16 '15 at 20:21
Are you using raw string notation. `regex.sub(r'...', r'\1', sentence)` — hwnd, Aug 16 '15 at 20:33
@hwnd I wasn't. Now I am and it is working. Thank you! I don't understand why though. What is raw string notation, and should I always use it? — Bram Vanroy, Aug 16 '15 at 20:38
@J.F.Sebastian So basically, with the raw string flag you don't have to "double escape" anything because the engine will recognise the correct special characters? — Bram Vanroy, Aug 17 '15 at 11:10
@BramVanroy: without `r""`, the regex engine won't see the backslash at all: `"\1" == "\x01"` -- a single character (U+0001). — jfs, Aug 17 '15 at 11:17
@J.F.Sebastian Silly question probably, but what's the point of that? — Bram Vanroy, Aug 17 '15 at 11:53
The point is that you should use raw string literals while working with regexes. — jfs, Aug 17 '15 at 11:59

Casimir et Hippolyte · Accepted Answer · 2015-08-18T21:58:58.803

3

As you can see unicode character classes like \p{L} are not available in the re module. However it doesn't means that you can't do it with the re module since \p{L} can be replaced with [^\W\d_] with the UNICODE flag (even if there are small differences between these two character classes, see the link in comments).

Second point, your approach is not the good one (if I understand well, you are trying to extract the last word of each line) because you have strangely decided to remove all that is not the last word (except the newline) with a replacement. ~52000 steps to extract 10 words in 10 lines of text is not acceptable (and will crash with more characters). A more efficient way consists to find all the last words, see this example:

import re

s = '''Ik heb nog nooit een kat gezien zo lélijk!
Het is een minder lelijk dan uw hond.'''

p = re.compile(r'^.*\b(?<!-)(\w+(?:-\w+)*)', re.M | re.U) 

words = p.findall(s)

print('\n'.join(words))

Notices:

To obtain the same result with python 2.7 you only need to add an u before the single quotes of the string: s = u'''...
If you absolutely want to limit results to letters avoiding digits and underscores, replace \w with [^\W\d_] in the pattern.
If you use the regex module, maybe the character class \p{IsLatin} will be more appropriate for your use, or whatever the module you choose, a more explicit class with only the needed characters, something like: [A-Za-záéóú...
You can achieve the same with the regex module with this pattern:
p = regex.compile(r'^.*\m(?<!-)(\pL+(?:-\pL+)*)', regex.M | regex.U)

Other ways:

By line with the re module:

p = re.compile(r'[^\w-]+', re.U)
for line in s.split('\n'):
    print(p.split(line+' ')[-2])

With the regex module you can take advantage of the reversed search:

p = regex.compile(r'(?r)\w+(?:-\w+)*\M', regex.U)
for line in s.split('\n'):
    print p.search(line).group(0)

edited Aug 18 '15 at 21:58

answered Aug 16 '15 at 22:51

Casimir et Hippolyte

88,009
5
94
125

I think you misunderstood. I don't have a single file consisting of sentences seperated by a new line. In fact, I do the regex on a single sentence at the time. In other words, I first already distinguished the lines, and then do the regex line per line. – Bram Vanroy Aug 17 '15 at 11:09
@BramVanroy: in this case you can use `search` instead of `findall` without the M modifier (that is useless) or one of the two other ways but replace the `s.split('\n')` with your list of lines. ~5700 steps, for the first sentence of your example stay a too high value. – Casimir et Hippolyte Aug 17 '15 at 11:24
Okay, thanks. I'm going to try this. Could you explain the regex in the very last example? As I said, I'm not familiar with Python or its regex methods. Especially `(?r)`, `\M` and `regex.U` would be useful to know. About the last but one example: isn't split rather slow? I always thought it'd be slower because you first have to find all individual matches and then make a split in the array? – Bram Vanroy Aug 17 '15 at 11:32
@BramVanroy: `(?r)` is a specific feature of the regex module. It allows to perform a search from the end of the string. `\M` is specific to regex module too, it is an advanced word boundary but only for the end of a word (`\m` for the beginning). `regex.U` or `re.U` are the UNICODE flag that extends the character classes like `\w`, `\s`, `\d` to unicode characters. About the number of steps, see the first pattern: https://regex101.com/r/tU2dO4/1 – Casimir et Hippolyte Aug 17 '15 at 11:37
O, I didn't know you could see the number of steps on regex101.com! That's handy! So now I understand everything that's going on, but when trying this in my code I get the following error: `AttributeError: 'NoneType' object has no attribute 'group'`. I googled some and it's probably to do with the lines containing non-latin characters. [This answer](http://stackoverflow.com/a/17614582/1150683) tells me to use the unicode flag. But I thought we were already using that? – Bram Vanroy Aug 17 '15 at 11:40
@BramVanroy: I don't think it has something to do with non-latin characters, you probably missed something somewhere. – Casimir et Hippolyte Aug 17 '15 at 11:45
In `p.search(line).group(0)`, `line` is the element of which the last word should be extracted, right? – Bram Vanroy Aug 17 '15 at 11:47
@BramVanroy: Yes, `line` is ... the line. – Casimir et Hippolyte Aug 17 '15 at 11:48
See my paste [here](http://pastebin.com/UjeQvNve). As you can see I first extract some context from the line, and from that content `lc` I want to get the last word, but when I do that I get the error I provided earlier. – Bram Vanroy Aug 17 '15 at 11:48
@BramVanroy: could you post somewhere an .lst file to make some test. – Casimir et Hippolyte Aug 17 '15 at 12:34
@BramVanroy: I got it. – Casimir et Hippolyte Aug 17 '15 at 13:29
Okay. I deleted the files. – Bram Vanroy Aug 17 '15 at 17:25
@BramVanroy: you don't need to make something special, all characters are latin characters (not chinese, arabic, japonese...) but since there are accented characters you need to use the U modifier in your patterns and ensure that files are opened as utf8 text files. See https://eval.in/417871 – Casimir et Hippolyte Aug 17 '15 at 18:19
Is that file a complete overhaul of my file, or is it simply a showcase? In other words, does it do the exact same thing? Can I use it in production? And what is the advantage of `with` over `for`? – Bram Vanroy Aug 17 '15 at 19:44
@BramVanroy: Try to make it work to see exactly what it does.(since I don't know exactly what you are trying to achieve, I can't be sure this is exactly what you need. As you can see datas are not stored into a collection but only displayed. The goal is to show a way to extract what you need (parts of the filename, parts of lines), perhaps you need to add changes). About the advantage of `with` see the manual. – Casimir et Hippolyte Aug 17 '15 at 20:07
Thanks, I tried it out and made changes to my original file. The script does stop when the sentence is only one word. `[lc, pw] = p_first_last_word.search(s.lower()).groups()` throws the error: *AttributeError: 'NoneType' object has no attribute 'groups'*. – Bram Vanroy Aug 18 '15 at 11:42
Also, why do you use `[./\\]` in p_filename? Why not simply `\.`? – Bram Vanroy Aug 18 '15 at 12:19
The opening assertion that `\pL` is exactly equivalent to `[^\W\d_]` is false. There are 9,293 code points in the BMP that have General_Category=Letter. The latter set, `[^\W\d_]` includes 1,342 code points absent from that GC for a total of 10,635 BMP code points. Among these are code points whose General_Category=Mark, General_Category=Letter_Number, and General_Category=Connector_Punctuation. Please study [UTS 18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties) for the details of `\w`: it covers more than you think it does. – tchrist Aug 18 '15 at 13:06
[Annex C of UTS 18](http://unicode.org/reports/tr18/#Compatibility_Properties) defines `\w` to contain `\p{alpha}`, `\p{gc=Mark}`, `\p{digit}`, `\p{gc=Connector_Punctuation}`, and `\p{Join_Control}`. Meanwhile `\p{alpha}` has more than `\p{gc=Letter}` in it. It also contains all the `\p{gc=Letter_Number}` code points (which are `\pN` code points not `\pL` ones) as well as all the circled letter code points in Block=Enclosed_Alphanumerics, which despite being `\p{gc=Other_Symbol}` code points (rather than `\pL` ones) are nonetheless `\p{alphabetic}` ones and consequently `\w` code points. – tchrist Aug 18 '15 at 13:18
@tchrist Any suggestions, then? – Bram Vanroy Aug 18 '15 at 14:53
@BramVanroy: In this case (only one word), make the first capturing group optional and test if it exists before assigning its match to `lc` (if I remember well). About `[./\\]`, since I didn't use the os package to extract the name of the file, I used this class to split on slash or backslash (for windows) and the point to extract the filename without extension and without the path. – Casimir et Hippolyte Aug 18 '15 at 21:47
@tchrist: Good to know, very interesting document. It remains to be seen how the different regex engines/implementations follow these descriptions. – Casimir et Hippolyte Aug 18 '15 at 21:53
@CasimiretHippolyte Well, you either conform to the spec or you do not. There are formal stipulations for what that means. RL1.2a is for the compatibility forms from Annex C, for example. I believe Matt’s library conforms to Level 1 as well as some elements from Level 2. – tchrist Aug 18 '15 at 21:56

score -1 · Answer 2 · edited May 23 '17 at 11:53

-1

This post explains how to use unicode properties in python:

Python regex matching Unicode properties

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

edited May 23 '17 at 11:53

Community

1
1

answered Aug 16 '15 at 19:49

melwil

2,547
1
19
34

1

[`regex` module](http://stackoverflow.com/a/4316097/4279) should be used instead. – jfs Aug 16 '15 at 23:10
1

@J.F.Sebastian Yes, you always want to use Matt’s `regex` module in doing any Unicode regular expressions in Python for any number of reasons, one of those being that he closely follows the published standard for these matters, [Unicode Technical Standard #18 Unicode Regular Expressions](http://unicode.org/reports/tr18). See my comments on the other answer about just how tricky things can be when you don’t even have [Level 1 Conformance: Basic Unicode Support](http://unicode.org/reports/tr18/#Basic_Unicode_Support), because without the actual properties you simply cannot get at what you need. – tchrist Aug 18 '15 at 13:26

Unicode Regex with regex not working in Python

2 Answers2