Convert numeric character reference notation to unicode string

Question

Is there a standard, preferably Pythonic, way to convert the &#xxxx; notation to a proper unicode string?

For example,

&#1502;&#1508;&#1490;&#1513;&#1497;

Should be converted to:

מפגשי

It can be done - quite easily - using string manipulations, but I wonder if there's a standard library for this.

Hint: that notation is called ["numeric character reference"](https://en.wikipedia.org/wiki/Numeric_character_reference). — Joachim Sauer, Jun 10 '13 at 07:24
Related : http://stackoverflow.com/questions/3894564/replace-numeric-character-references-in-xml-document-using-python — Ashwini Chaudhary, Jun 10 '13 at 07:27
Possible duplicate of http://stackoverflow.com/questions/663058/html-entity-codes-to-text — Jared, Jun 10 '13 at 07:27
@AshwiniChaudhary: that one is about a very specific case (UTF-16 codepoints encoded as characters), Jared: that one is about *named* character references (it's possible that the answers still apply, but I don't know). — Joachim Sauer, Jun 10 '13 at 07:30

TerryA · Accepted Answer · 2013-06-10T07:39:15.827

10

>>> from HTMLParser import HTMLParser
>>> h = HTMLParser()
>>> s = "&#1502;&#1508;&#1490;&#1513;&#1497;"
>>> print h.unescape(s)
מפגשי

It's part of the standard library, too.

However, if you're using Python 3, you have to import from html.parser:

>>> from html.parser import HTMLParser
>>> h = HTMLParser()
>>> s = '&#1502;&#1508;&#1490;&#1513;&#1497;'
>>> print(h.unescape(s))
מפגשי

edited Jun 10 '13 at 07:39

answered Jun 10 '13 at 07:32

TerryA

`unescape` appears to be internal and undocumented. Is there an "official" way? – georg Jun 10 '13 at 07:47
@thg435 Not that I know of, sorry – TerryA Jun 10 '13 at 07:50
I haven't found it either. Well, this kinda sucks, doesn't it? – georg Jun 10 '13 at 07:55
1

Seems there's now an official way since Python 3.4 using [html.unescape(s)](https://docs.python.org/3/library/html.html#html.unescape). – tlwhitec Apr 25 '17 at 10:11

1 Answers1