Unicode HTML Conversion to ASCII in Python

Question

Possible Duplicate:
Unescaping Characters in a String with Python

I have a string of unicode HTML in Python which begins with: \u003ctable>\u003ctr I need to convert this to ascii so I can then parse it with BeautifulSoup. However, Python's encode and decode functions seem to have no effect; I get the original string no matter what I try. I'm new to Python and unicode in general, so help would be much appreciated.

BeautifulSoup can handle Unicode. In fact, it goes to great lengths to make everything unicode, with a class called "UnicodeDammit". — Thomas K, Jul 01 '11 at 17:02
Oh, wait, I think I see what you mean. You've somehow got a byte string including those characters? Try `s.decode("unicode-escape")`. Or if it's in your code, write it as `u"\u003ctable>\u003ctr"`. — Thomas K, Jul 01 '11 at 17:04
The other likely source of a `\u003c` escape is JSON. If you are receiving JSON-encoded input you should be decoding the entire thing with `json.loads` and picking out the property in question. Don't rely on `unicode-escape` if the input is actually JSON: Python and JavaScript string literals are similar but **not** the same; you'll get the wrong results for characters outside the Basic Multilingual Plane. — bobince, Jul 03 '11 at 08:20

score 4 · Accepted Answer · edited Jul 01 '11 at 17:29

4

Use

s.decode("unicode-escape")

to decode the html data first (no idea how you get this character crap from).

edited Jul 01 '11 at 17:29

mechanical_meat

163,903
24
228
223

answered Jul 01 '11 at 17:11

score 0 · Answer 2 · answered Jul 01 '11 at 17:00

0

I have no clue what you're talking about. I suspect that I'm not the only one.

>>> s = BeautifulSoup.BeautifulSoup(u'<html><body>\u003ctable>\u003ctr</body></html>')
>>> s
<html><body><table><tr></tr></table></body></html>

answered Jul 01 '11 at 17:00

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Unicode HTML Conversion to ASCII in Python

2 Answers2