0

Ok, I am doing a unicode regex match on some strings.

These are the strings in question. Not two separate lines, but two separate strings.

\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director

\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019 Season 2

And I am using this regex to parse out the titles surround in unicode quotes.

regex = re.compile("\\u2018[^(?!\\u2018$)]*\\u2019",re.UNICODE)

using regex.findall() returns me

['u2018Mama\\u2019']

and

['u2018Glee\\u2019', 'u2018Arrow\\u2019']

This brings up two questions that I couldn't figure out. why isn't it returning \u2018, where is the initial \?

Secondly, what is different. I can't see it. Finally, I replaced \u2018 and \u2019 with '. Then using this regex.

re.compile("'[^']*'")

It matches both in both strings. What is the difference here? What am I missing in the unicode regex?

Thank you in advance.

Jeremy Thiesen
  • 167
  • 2
  • 14
  • Are you using Python 2, or 3? (This affects how string literals are parsed.) Is the first character of the input "‘", or "\\"? (That is, are you showing us the repr of strings that print with curly quotes, or do they actually contain backslashes?) The missing backslash problem could be the pattern string containing \u, matching the letter u. – deltab Sep 14 '13 at 05:54
  • Are you trying to match `Reboot May Get` and `Director` for the first string then `Star Grant Gustin to Play The Flash` and `Season 2` in the second string? – smac89 Sep 14 '13 at 06:13
  • Python 2.7, and trying to get Mummy, Mama, Glee, and Arrow. – Jeremy Thiesen Sep 14 '13 at 19:09

1 Answers1

1
#coding=utf8

import re

s=u'''\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director
\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019 Season 2'''
print s
regex = re.compile(ur"‘[^(?!‘$)]*’",re.UNICODE)
m = regex.findall(s)
print m

[u'\u2018Mummy\u2019', u'\u2018Mama\u2019', u'\u2018Glee\u2019', u'\u2018Arrow\u2019']

thinker3
  • 12,771
  • 5
  • 30
  • 36
  • I get a syntax error here because of the unicode characters? I tried using this SO question, http://stackoverflow.com/questions/11741574/how-to-set-the-default-encoding-to-utf-8-in-python, to change, but still no luck. And I would rather keep it simple so I don't have to worry about encoding problems on different servers etc. – Jeremy Thiesen Sep 14 '13 at 19:09
  • @Jeremy Thiesen: did you forget #coding=utf8, it is important, not just a comment. – thinker3 Sep 14 '13 at 19:52