31

When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?

martineau
  • 119,623
  • 25
  • 170
  • 301
Vicky
  • 1,657
  • 6
  • 23
  • 33

3 Answers3

34

Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:

>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'

Another way of achieving this:

>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'

Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:

>>> print name.decode('latin-1')
Christensen Sköld

BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:

>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
Mark van Lent
  • 12,641
  • 4
  • 30
  • 52
  • thanks a lot dude. So if I need it to save it to a database i can decode it and save to the database, right? – Vicky Jun 14 '09 at 10:10
  • 2
    NO, read Mark's example again. After decoding the data from whatever it is (latin1, cp1252, etc) into unicode, you need to encode your unicode string with an encoding that (1) your database supports and (2) preserves all unicode characters ... typically UTF-8. – John Machin Jun 14 '09 at 22:45
10

Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.

joeforker
  • 40,459
  • 37
  • 151
  • 246
  • 1
    while not exactly the answer to the question,this is the proper answer when you get strings encoded like '\u00e9' – Tshirtman Nov 27 '19 at 17:23
10

I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:

>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
BJ Homer
  • 48,806
  • 11
  • 116
  • 129