0

I'm getting a strange behaviour running this code:

regex.search(ur'([^\p{IsAlnum}\s\.\'\`\,\-])', u'\U0001f618')

This should match \U0001f618, which is the unicode representation of a kissing emoji. The result, however, is the following:

<regex.Match object; span=(0, 1), match=u'\ud83d'>

This doesn't make sense at all, because u'\ud83d' is not even a valid unicode character.

I expected this instead:

<regex.Match object; span=(0, 1), match=u'\U0001f618'>

What is happening here?

I'm running Python 2.7.13 on macOS Sierra 10.12.6, regex.__version__ is 2.4.130.

user41951
  • 182
  • 1
  • 1
  • 10
  • Cannot reproduce. Same python version, same regex version, output is ``. I'm on Manjaro instead of Mac, but not sure how that would make a difference. Maybe try reinstalling the regex module? – Aran-Fey Oct 04 '17 at 10:32
  • Same Python and regex versions, however, on Linux platform. Works as you expect. – mhawke Oct 04 '17 at 10:35
  • Same Python and regex versions, can reproduce on macOS Sierra 10.12.6: `` – jbndlr Oct 04 '17 at 10:41
  • I'm also running macOS Sierra 10.12.6 – user41951 Oct 04 '17 at 11:01
  • 1
    This may help: [Python returns length of 2 for single Unicode character string](https://stackoverflow.com/questions/29109944/python-returns-length-of-2-for-single-unicode-character-string) – PM 2Ring Oct 04 '17 at 11:13

1 Answers1

1

As mentioned by @PM 2Ring, it is happening because Python is compiled with UCS-2 support (narrow range) instead of UCS-4 support (wide range). Because of this, Python internally (and incorrectly) represents u'\U0001f618' as two characters, which explains the regex result.

More information here: https://stackoverflow.com/a/29109996/4111012

user41951
  • 182
  • 1
  • 1
  • 10