2

Say I have the following HTML emoji entity: '&#x1f604 ;'

Note there isn't actually a space between the 4 and the ; it's just there so that it doesn't show up as a smiley

The emoji's Python form is: u"\U0001f604"

How do I convert all HTML emoji entities to their Python form?


Things I have tried so far:

  • Encode to utf-8
  • Unescape the text using HTML Parser and then convert
  • Use regex (couldn't get something that worked for all of the HTML emoji entities -- not as simple as swapping &#x with \U000 as that only works for some entities)
GangstaGraham
  • 8,865
  • 12
  • 42
  • 60
  • Possible duplicate: http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string – Robᵩ Mar 04 '16 at 20:16
  • I agree that it is a duplicate. It turns out the solutions on that question did not work for me (I had looked at it before posting this) because Python 2.7.10's HTMLParser seems to be buggy – GangstaGraham Mar 04 '16 at 20:32

1 Answers1

5

HTMLParser.unescape does just that:

In [3]: HTMLParser.HTMLParser().unescape( '😄' )
Out[3]: u'\U0001f604'
Robᵩ
  • 163,533
  • 20
  • 239
  • 308