8

Is there a standard, preferably Pythonic, way to convert the &#xxxx; notation to a proper unicode string?

For example,

מפגשי

Should be converted to:

מפגשי

It can be done - quite easily - using string manipulations, but I wonder if there's a standard library for this.

Adam Matan
  • 128,757
  • 147
  • 397
  • 562
  • Hint: that notation is called ["numeric character reference"](https://en.wikipedia.org/wiki/Numeric_character_reference). – Joachim Sauer Jun 10 '13 at 07:24
  • Related : http://stackoverflow.com/questions/3894564/replace-numeric-character-references-in-xml-document-using-python – Ashwini Chaudhary Jun 10 '13 at 07:27
  • Possible duplicate of http://stackoverflow.com/questions/663058/html-entity-codes-to-text – Jared Jun 10 '13 at 07:27
  • @AshwiniChaudhary: that one is about a very specific case (UTF-16 codepoints encoded as characters), Jared: that one is about *named* character references (it's possible that the answers still apply, but I don't know). – Joachim Sauer Jun 10 '13 at 07:30

1 Answers1

10

Use HTMLParser.HTMLParser():

>>> from HTMLParser import HTMLParser
>>> h = HTMLParser()
>>> s = "מפגשי"
>>> print h.unescape(s)
מפגשי

It's part of the standard library, too.


However, if you're using Python 3, you have to import from html.parser:

>>> from html.parser import HTMLParser
>>> h = HTMLParser()
>>> s = 'מפגשי'
>>> print(h.unescape(s))
מפגשי
TerryA
  • 58,805
  • 11
  • 114
  • 143