2

Possible Duplicate:
Unescaping Characters in a String with Python

I have a string of unicode HTML in Python which begins with: \u003ctable>\u003ctr I need to convert this to ascii so I can then parse it with BeautifulSoup. However, Python's encode and decode functions seem to have no effect; I get the original string no matter what I try. I'm new to Python and unicode in general, so help would be much appreciated.

Community
  • 1
  • 1
mdk25
  • 23
  • 1
  • 4
  • BeautifulSoup can handle Unicode. In fact, it goes to great lengths to make everything unicode, with a class called "UnicodeDammit". – Thomas K Jul 01 '11 at 17:02
  • Oh, wait, I think I see what you mean. You've somehow got a byte string including those characters? Try `s.decode("unicode-escape")`. Or if it's in your code, write it as `u"\u003ctable>\u003ctr"`. – Thomas K Jul 01 '11 at 17:04
  • Yeah, you guys and Sentinel below are all correct. Thanks. – mdk25 Jul 01 '11 at 17:24
  • The other likely source of a `\u003c` escape is JSON. If you are receiving JSON-encoded input you should be decoding the entire thing with `json.loads` and picking out the property in question. Don't rely on `unicode-escape` if the input is actually JSON: Python and JavaScript string literals are similar but **not** the same; you'll get the wrong results for characters outside the Basic Multilingual Plane. – bobince Jul 03 '11 at 08:20

2 Answers2

4

Use

s.decode("unicode-escape")

to decode the html data first (no idea how you get this character crap from).

mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
0

I have no clue what you're talking about. I suspect that I'm not the only one.

>>> s = BeautifulSoup.BeautifulSoup(u'<html><body>\u003ctable>\u003ctr</body></html>')
>>> s
<html><body><table><tr></tr></table></body></html>
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358