0

I am using Readability Parser API to extract content from a web page. It is ok when the web page is in Latin character set, but when I extract article in Cyrillic, it ends up with the following:

<div>&#x412;&#x432;&#x43E;&#x441;&#x43A;&#x440;&#x435;&#x441;&#x435;&#x43D;&#x44C;</div>...etc

The interesting thing here is that the title of a web page is extracted correctly in Cyrillic, but not the content. My attempt was to do the following as it suggested in this SO answer:

content = unicodedata.normalize('NFKD', content).encode('ascii','ignore')

but it did not work. Could you tell me if there is a way to convert this string before saving to database?

Please let me know if the title of my question explains correctly what I need. Thank you.

Community
  • 1
  • 1
nickbusted
  • 1,029
  • 4
  • 18
  • 30

1 Answers1

2

One way (Python 3.3):

>>> s='<div>&#x412;&#x432;&#x43E;&#x441;&#x43A;&#x440;&#x435;&#x441;&#x435;&#x43D;&#x44C;</div>'
>>> import html.parser
>>> h=html.parser.HTMLParser()
>>> h.unescape(s)
'<div>Ввоскресень</div>'

Python 2.7:

>>> s='<div>&#x412;&#x432;&#x43E;&#x441;&#x43A;&#x440;&#x435;&#x441;&#x435;&#x43D;&#x44C;</div>'
>>> import HTMLParser
>>> h=HTMLParser.HTMLParser()
>>> print(h.unescape(s))
<div>Ввоскресень</div>

P.S. I went to look for the documentation link and it looks like unescape isn't documented. Here's a way without using an undocumented API:

>>> re.sub(r'&#x(.*?);',lambda x: chr(int(x.group(1),16)),s)
'<div>Ввоскресень</div>'

Per comment it looks finally documented (and moved) in Python 3.4:

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thank you very much, I really appreciate it! I would only add that h.parser.unescape is deprecated(Python 3.5), so I used html.unescape(). – nickbusted Oct 27 '14 at 01:19