0

Possible Duplicate:
Decode HTML entities in Python string?

I have a malformed string in Python:

Muhammad Ali's fight with Larry Holmes

where ' is a apostrophe.

Firstly what representation is this: '? Secondly, how can I parse the string in python so that it replaces ' with '

Community
  • 1
  • 1
Bruce
  • 33,927
  • 76
  • 174
  • 262
  • 3
    This looks like a HTML entity of a character with code 39 (which would make it easy to parse and reassemble using `chr()`. However there are is also a big number of symbolic HTML entities like `&` (`&`) which you'd probably want to also consider. – Kos Nov 13 '11 at 20:17
  • @All: I did not know how to search for an answer because I did not know what to search. – Bruce Nov 13 '11 at 20:20

2 Answers2

5

The Python Standard Library's HTMLParser is able to decode HTML entities in strings.

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'

A range of solutions are described here: http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/

Acorn
  • 49,061
  • 27
  • 133
  • 172
1

The &#CHAR-CODE; is a sytax for special chars in html (maybe elsewhere, but I'm not sure). There may be a more complete way to do this, but you could replace it simply with:

mystring = "Muhammad Ali's fight with Larry Holmes"
print mystring.replace("'", "'")

Yields:

Muhammad Ali's fight with Larry Holmes

Adam Wagner
  • 15,469
  • 7
  • 52
  • 66