2

I have a string of escaped html markup , 'í', and I want it to the correct accented character 'í'.

Having read around SO, this is my attempt:

messy = 'í'
print type(messy)
>>> <type 'str'>

decoded=messy.decode('utf-8')
print decoded
>>> &#xed;

Drats. After reading here, I tried this:

from BeautifulSoup import *
soup = BeautifulSoup(messy, convertEntities=BeautifulSoup.HTML_ENTITIES)
print soup.contents[0].string
>>> &#xed;

Still not working, so I tested the example from the SO question I linked to previously.

html = '&#196;'
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
print soup.contents[0].string
>>> Ä

This one works. Does anyone see what I am missing?

Community
  • 1
  • 1

1 Answers1

0

Use HTMLParser.HTMLParser.unescape:

>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('&#xed;')
u'\xed'
>>> print parser.unescape('&#xed;')
í

In Python 3.x:

>>> import html.parser
>>> parser = html.parser.HTMLParser()
>>> parser.unescape('&#xed;')
'í'
falsetru
  • 357,413
  • 63
  • 732
  • 636