Regex Matching Unicode Characters acting oddly with different strings

Question

Ok, I am doing a unicode regex match on some strings.

These are the strings in question. Not two separate lines, but two separate strings.

\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director

\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019 Season 2

And I am using this regex to parse out the titles surround in unicode quotes.

regex = re.compile("\\u2018[^(?!\\u2018$)]*\\u2019",re.UNICODE)

using regex.findall() returns me

['u2018Mama\\u2019']

and

['u2018Glee\\u2019', 'u2018Arrow\\u2019']

This brings up two questions that I couldn't figure out. why isn't it returning \u2018, where is the initial \?

Secondly, what is different. I can't see it. Finally, I replaced \u2018 and \u2019 with '. Then using this regex.

re.compile("'[^']*'")

It matches both in both strings. What is the difference here? What am I missing in the unicode regex?

Thank you in advance.

Are you using Python 2, or 3? (This affects how string literals are parsed.) Is the first character of the input "‘", or "\\"? (That is, are you showing us the repr of strings that print with curly quotes, or do they actually contain backslashes?) The missing backslash problem could be the pattern string containing \u, matching the letter u. — deltab, Sep 14 '13 at 05:54
Are you trying to match `Reboot May Get` and `Director` for the first string then `Star Grant Gustin to Play The Flash` and `Season 2` in the second string? — smac89, Sep 14 '13 at 06:13

thinker3 · Accepted Answer · 2013-09-14T06:25:33.567

1

#coding=utf8

import re

s=u'''\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director
\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019 Season 2'''
print s
regex = re.compile(ur"‘[^(?!‘$)]*’",re.UNICODE)
m = regex.findall(s)
print m

[u'\u2018Mummy\u2019', u'\u2018Mama\u2019', u'\u2018Glee\u2019', u'\u2018Arrow\u2019']

edited Sep 14 '13 at 06:25

answered Sep 14 '13 at 06:18

thinker3

12,771
5
30
36

I get a syntax error here because of the unicode characters? I tried using this SO question, http://stackoverflow.com/questions/11741574/how-to-set-the-default-encoding-to-utf-8-in-python, to change, but still no luck. And I would rather keep it simple so I don't have to worry about encoding problems on different servers etc. – Jeremy Thiesen Sep 14 '13 at 19:09
@Jeremy Thiesen: did you forget #coding=utf8, it is important, not just a comment. – thinker3 Sep 14 '13 at 19:52

Regex Matching Unicode Characters acting oddly with different strings

1 Answers1