0

Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python

I am reading an excel XML document using Python. I end up with a lot of characters such as é

That represent various accented letters (and the like). Is there an easy way to convert these characters to utf-8?

Community
  • 1
  • 1
Neil Aggarwal
  • 511
  • 1
  • 10
  • 29
  • 1
    You'll need to give more details. Usually it is relatively easy to encode and decode in python, provided you understand what is going on. – Martijn Pieters Dec 18 '12 at 08:11
  • 1
    In particular, are you using Python 2 or 3, do you have byte strings or Unicode strings, and if byte strings what character set are they in? (It also may help to know which module you're using to read/parse the document.) – abarnert Dec 18 '12 at 08:13
  • Thanks Marijn for the quick response. I think the main problem I am facing is that I dont know what encoding this is. I get the sense that its not an "encoding" really, rather something specific to xml. In terms of more info, I dont really have any. I have a list of names with "encodings" such as the one above all over the place. The names are from various countries, thus, the various accented characters. – Neil Aggarwal Dec 18 '12 at 08:13
  • Using Python2, string comes in as bytes (string is from an excel xml file), but I convert it to unicode using .decode("utf-8"), and the set is utf-8. – Neil Aggarwal Dec 18 '12 at 08:16
  • Thank you so much. It is a repeat. I searched for a long time for another answer, but didn't come across that one. I am relatively new to coding, so my search terms probably werent right. Thanks again. – Neil Aggarwal Dec 18 '12 at 08:17
  • 1
    OK, so you have properly-decoded Unicode strings, except that some of the characters are escaped as XML entity references rather than directly available as characters. Depending on how you're doing the XML parsing, you may be able to do it while parsing; otherwise, this definitely looks like a dup of the other question. – abarnert Dec 18 '12 at 08:17

2 Answers2

1

If you just want to parse the HTML entity to its unicode equivalent:

>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('é')
u'\xe9'
>>> print parser.unescape('é')
é

This is for Python 2.x, for 3.x the import is import html.parser

Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
  • This is an undocumented function that just happens to be in the CPython implementation of `HTMLParser`—and it doesn't actually work properly until either 2.6/3.0 or 2.7/3.1 (I forget which). So I don't think it's the ideal solution, except for a quick&dirty hack. There are better solutions (along with this one) on the question this is a dup of. – abarnert Dec 18 '12 at 08:30
  • Using the tips from this QandA and the other, I have the following solution which seems to work: – Neil Aggarwal Dec 18 '12 at 18:29
0

Using tips from this QandA and the other one, I have a solution that seems to work. It takes an entire document and removes all html entities from the document.

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value
Neil Aggarwal
  • 511
  • 1
  • 10
  • 29
  • Obviously, one downside of the above code is that if the same html entity appears more than once in the page (as it almost always does), the above code will run the same replace call multiple times. Its an easy fix, just have to remove all repeats from list_of_html set before running the replace loop. – Neil Aggarwal Dec 18 '12 at 18:36