2

I need to remove emoji's from some strings using a python script. I found that someone already asked this question, and one of the answers was marked as successful, namely that the following code would do the trick:

#!/usr/bin/env python
import re

text = u'This dog \U0001f602'
print(text) # with emoji

emoji_pattern = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       "]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji

I inserted this code into my script, and changed it only to be acting on the strings in my code rather than the sample text. When I run the code, though, I get some errors I don't understand:

Traceback (most recent call last):
  File "SCRIPT.py", line 31, in get_tweets
"]+", flags=re.UNICODE)
  File "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework /Versions/2.7/lib/python2.7/re.py", line 194, in compile
    return _compile(pattern, flags)
  File "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

I get what the error is saying, but since I grabbed this code from Stackexchange, I cannot figure out why it apparently worked for the people in this discussion but not for me. I'm using Python 2.7 if that helps. Thank you!

Community
  • 1
  • 1
Kirk S.
  • 151
  • 6

1 Answers1

0

Your Python build uses surrogate pairs to represent unicode characters that can't be represented in 16 bits -- it's a so-called "narrow build". This means that any value at or above u"\U00010000" is being stored as two characters. Since even in unicode mode, the regular expression parser works character-by-character, this can lead to incorrect behavior if you try to use characters in that range.

In this particular case, Python is only seeing the first "half" of the emoji character code as the end of a range, and that "half" is less than the start value of the range, making it invalid.

Python 2.7.10 (default, Jun  1 2015, 09:44:56) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
65535
>>> tuple(u"\U00010000")
(u'\ud800', u'\udc00')

Basically, you need to get a "wide build" of Python for this to work:

Python 3.5.2 (default, Jul 28 2016, 21:28:00) 
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> tuple(u"\U00010000")
('',)

The character isn't showing up correctly for me in the browser, but it does show only one character, not two.

agf
  • 171,228
  • 44
  • 289
  • 238