0

I'm new to python, and trying to use urllib2/lxml to fetch, and parse a page. Everything seems to work fine, except, the parsed page, when opened in my browser seems to have strange characters embedded in it. I'm guessing this is a unicode/lxml parsing problem. When I get the text content of an element, using .text_content(), and print it, I get stuff like: "sometext \342\200\223 moretext" in the original page, this shows as "sometext - moretext"

Could anyone tell me:
1. what's going on?
2. how do I fix it?
3. where can I read up on encoding issues like these?

Thanks!

Toki Tom
  • 13
  • 3

2 Answers2

2

What is going on is that the website is using an "endash", which is a slightly longer dash (and the one you should use in ranges, like 40-56, really. Yeah, dashes is a whole science unto itself).

In Unicode, the endash has codepoint U+2013. The numbers you get, \342\200\223 is the octal representation of the UTF-8 encoding of that codepoint. Why you get octal I don't know, I get hex, so on my computer it looks like '\xe2\x80\x93'. But that makes no difference, that's just the respresentation. The numbers are the same.

What you probably should do is to decode the HTML string you get to unicode as early as possible. The headers you get back when you fetch the page should tell you what encoding it uses (although it's apparently UTF8 here), it's fairly easy to extract that data from the headers, you'll see it when you print out the headers.

You then decode the html data:

htmldata = htmldata.decode(<the encoding you found in the headers>)
Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • Should the data be left as unicode when passing it to other programs? I currently serialize the data using thrift/pb (and its later read by non-unicode aware C/C++ programs), what is the best way of handling this? Can one freely convert between say ISO-8859-1 and UTF-8? That is, if the C++ programs are ported to be aware, and expect all input in UTF-8, would that be best? Thank you! – Toki Tom Dec 11 '10 at 06:18
  • @Toki Tom: See http://docs.python.org/howto/unicode.html#tips-for-writing-unicode-aware-programs for tips on writing unicode-aware programs. UTF-8 can express all unicode code points (there are over a million of them). See http://en.wikipedia.org/wiki/UTF-8. ISO-8859-1 can express 256 code points. See http://en.wikipedia.org/wiki/ISO/IEC_8859-1. Code points between U+0000 and U+00FF map to the same byte values in both UTF-8 and ISO-8859-1, so the conversion from ISO-8859-1 to UTF-8 is really just the identity mapping. But not all UTF-8 can be decoded to unicode and re-encoded as ISO-8859-1. – unutbu Dec 11 '10 at 11:28
  • @Toki Tom: Other "programs" no. You can't leave it as Unicode. Unicode is *not* a way to encode data. When you want to exchange unicode data from one software to another you need to encode it with an encoding, such as UTF8 or Latin-1. When sending it to other Python functions, then yes of you can keep it as Unicode. – Lennart Regebro Dec 11 '10 at 14:18
  • @unutbu, @Lennart: Just to be clear, I should always be able to decode to UTF-8, both from ascii and ISO-8859, which seem to be the most prevalent encodings. I should always decode from whatever encoding I get it in to unicode when working within python (using the .decode() function. When serializing the data, I should .encode() to whatever encoding I want to use (UTF-8 seems best). If the program that reads the serialized data doesn't understand UTF-8 (say, only ascii), then it'll get those code points that are the same in both, but everything else will be gibberish. All correct? – Toki Tom Dec 11 '10 at 20:45
  • Further, since I ran out space above, thanks to both of you :). – Toki Tom Dec 11 '10 at 20:47
  • Yes. (except that you misswrote as "decode to UTF8" first, when you encode to UTF). – Lennart Regebro Dec 11 '10 at 21:20
0

You'll mainly need to be mindful of unicode issues at two points in the process:

  1. Get the response into a unicode string, nicely explained here on SO
  2. Specify a suitable encoding when outputting strings

--

#  from an lxml etree
etree.tostring(root, encoding='utf-8', xml_declaration=False)

# from a unicode string
x.encode('utf-8')
Community
  • 1
  • 1
Cameron Jordan
  • 759
  • 3
  • 5