Python 2.7 range regex matching unicode emoticons

Question

How to count the number of unicode emoticons in a string using python 2.7 regex? I tried the first answer posted for this question. But it has been showing invalid expression error.

re.findall(u'[\U0001f600-\U0001f650]', s.decode('utf-8')) is not working and showing invalid expression error

How to find and count emoticons in a string using python?

"Thank you for helping out (Emoticon1) Smiley emoticon rocks!(Emoticon2)"

Count : 2

score 0 · Answer 1 · edited May 23 '17 at 11:43

The problem is probably due to using a "narrow build" of Python 2. That is, if you fire up your interpreter, you'll find that sys.maxunicode == 0xffff is True.

This site has a few interesting notes on wide builds of Python (which are commonly found on Linux, but not, as the link suggests, on OS X in my experience). These builds use UCS-4 internally to encode characters, and as a result seem to have saner support for higher range Unicode code points, such as the ranges you are talking about. Narrow builds apparently use UTF-16 internally, and as a result encode these higher code points using "surrogate pairs". I presume this is the reason you see a bad character range error when you try and compile this regular expression.

The only solution I know is to switch to a python version >= 3.3 which no longer has the wide/narrow distinction if you can, or install a wide Python build

Python 2.7 range regex matching unicode emoticons

1 Answers1