8

I used BeautifulSoup to handle XML files that I have collected through a REST API.

The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely.

Unfortunately I need the HTML code.


How would I go on about transforming the escaped HTML into proper markup?


Help would be very much appreciated!

RadiantHex
  • 24,907
  • 47
  • 148
  • 244

2 Answers2

19

I think you want xml.sax.saxutils.unescape from the Python standard library.

E.g.:

>>> from xml.sax import saxutils as su
>>> s = '<foo>bar</foo>'
>>> su.unescape(s)
'<foo>bar</foo>'
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 3
    Unfortunately it doesn't handle all characters. For example, ``su.unescape('"')`` doesn't work. – Filipe Correia Apr 04 '13 at 10:56
  • 2
    You can escape other characters by specifying them in a dictionary as the second parameter to `unescape`. For example: `su.unescape(s, {'"':'"'})` – amccormack Sep 03 '15 at 17:56
2

You could try the urllib module?

It has a method unquote() that might suit your needs.

Edit: on second thought, (and more reading of your question) you might just want to just use string.replace()

Like so:

string.replace('&lt;','<')
string.replace('&gt;','>')
Nathan Osman
  • 71,149
  • 71
  • 256
  • 361
  • 3
    Why would you bother with coding the different replace steps (for lt, gt, amp) when the saxutils.unescape wraps them all up for you?-) Plus, remember: the replace call doesn't alter the string, it builds a new string. Your code snippet, as given, is a slow no-op!-) – Alex Martelli Mar 19 '10 at 20:29