0

I know from this question that, nothing to repeat in a regex expression, is a known bug of python. But I must compile this unicode expression

re.compile(u'\U0000002A \U000020E3')

as a unique character. This is an emoticon and is a unique character. Python understand this string as u'* \\u20e3' and rise me 'nothing to repeat' error. I am looking around but I can't find any solution. Does exist any work around?

Community
  • 1
  • 1
emanuele
  • 2,519
  • 8
  • 38
  • 56

2 Answers2

6

This has little to do with the question you linked. You're not running into a bug. Your regex simply has a special character (a *) that you haven't escaped.

Simply escape the string before compiling it into a regex:

re.compile(re.escape(u'\U0000002A \U000020E3'))

Now, I'm a little unsure as to why you're representing * as \U0000002A — perhaps you could clarify what your intent is here?

Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116
  • `u'\U0000002A \U000020E3'` is an emoticon. I must catch it in regular expression as a single character. Escaping it does not work. But when I try to compile it, 'nothing to repeat error' arise. – emanuele Jan 11 '16 at 10:24
  • Thomas, OP reads them in from a file (the patterns are "dynamic"). This question seems to be related to previous OP's posts. – Wiktor Stribiżew Jan 11 '16 at 10:26
  • @emanuele while this character is represented as a single character, it's not technically a single character. That doesn't really matter though -- the issue might be elsewhere. Can you show how you use this regex and on what input? – Thomas Orozco Jan 11 '16 at 10:42
  • @ThomasOrozco Sure. This question is related to this other question: http://stackoverflow.com/questions/34681364/how-to-build-a-regular-vocabulary-of-emoticons-in-python I have a file that is a list of emoticons as ascii string. The ascii strings are representing unicodes strings. My code breaks when * is present. – emanuele Jan 11 '16 at 10:48
0

You need to use re.escape (as shown in "Thomas Orozco" answer) But use it only on the part that is dynamic such as:

print re.findall( u"cool\s*%s" % re.escape(u'\U0000002A \U000020E3'),
               u"cool      * \U000020E3 crazy")
Yoav Glazner
  • 7,936
  • 1
  • 19
  • 36