2

I want to replace all emoji with '' but my regEx doesn't work.
For example,

content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633?'

and I want to replace all the forms like \U0001f633 with '' so I write the code:

print re.sub(ur'\\U[0-9a-fA-F]{8}','',content)

But it doesn't work.
Thanks a lot.

Iron Fist
  • 10,739
  • 2
  • 18
  • 34
sophiaCY
  • 23
  • 3

1 Answers1

3

You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.

Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:

# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')

And your code would look like:

import re

# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')

content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)

stripped = re_strip.sub('', content)
print(stripped)

Both expressions, reduce the number of characters in the stripped string to 26.

These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.

You can determine whether your python install will only recognize 16-bit codepoints by doing something like:

import sys
print(sys.maxunicode.bit_length())

If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.

Neither expression will work when used on a python install with the wrong sys.maxunicode.

See also: this related.

Community
  • 1
  • 1
jedwards
  • 29,432
  • 3
  • 65
  • 92
  • Thank you very much. It works. But you say it may also strip out other things I want. So what if I only want to remove emoji? The content may contain Chinese characters, numbers, letters, punctuations and emoji. BTW, my python is compiled with only 16-bit unicode code points. – sophiaCY Jul 31 '16 at 11:40
  • Well, it *may*. The codepoints I'm "filtering" out start with 10000 [here](http://jrgraphix.net/research/unicode_blocks.php). So anything in "Linear B Syllabary" through "Tags". In my experience, most fonts don't even have glyphs for those codepoints. So it's (very) unlikely that anything you *do* want is in that range, so the filtering is probably fine, but it's just something to be aware of. – jedwards Jul 31 '16 at 11:51