15

The following unicode and string can exist on their own if defined explicitly:

>>> value_str='Andr\xc3\xa9'
>>> value_uni=u'Andr\xc3\xa9'

If I only have u'Andr\xc3\xa9' assigned to a variable like above, how do I convert it to 'Andr\xc3\xa9' in Python 2.5 or 2.6?

EDIT:

I did the following:

>>> value_uni.encode('latin-1')
'Andr\xc3\xa9'

which fixes my issue. Can someone explain to me what exactly is happening?

Thierry Lam
  • 45,304
  • 42
  • 117
  • 144
  • This is the THIRD question that you've asked in less than a day, all based on the same misunderstanding. `u'Andr\xc3\xa9'` is a nonsense obtained by a double encoding with utf8 and latin1. Just don't do that! – John Machin May 06 '10 at 22:32
  • That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one? – Thierry Lam May 07 '10 at 03:45

7 Answers7

16

You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.

But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'

Then decode it correctly:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'    

Now it is in the correct format.

However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
6

If you have u'Andr\xc3\xa9', that is a Unicode string that was decoded from a byte string with the wrong encoding. The correct encoding is UTF-8. To convert it back to a byte string so you can decode it correctly, you can use the trick you discovered. The first 256 code points of Unicode are a 1:1 mapping with ISO-8859-1 (alias latin1) encoding. So:

>>> u'Andr\xc3\xa9'.encode('latin1')
'Andr\xc3\xa9'

Now it is a byte string that can be decoded correctly with utf8:

>>> 'Andr\xc3\xa9'.decode('utf8')
u'Andr\xe9'
>>> print 'Andr\xc3\xa9'.decode('utf8')
André

In one step:

>>> print u'Andr\xc3\xa9'.encode('latin1').decode('utf8')
André
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
5

You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?"""

In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description:

Note: All strings will be displayed using (implicitly) repr(). unicodedata.name() will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings.

Initial state: you have a unicode object that you have named u1. It contains e-acute:

>>> u1 = u'\xe9'
>>> import unicodedata as ucd
>>> ucd.name(u1)
'LATIN SMALL LETTER E WITH ACUTE'

You encode u1 as UTF-8 and name the result s:

>>> s = u1.encode('utf8')
>>> s
'\xc3\xa9'

You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish.

>>> u2 = s.decode('latin1')
>>> u2
u'\xc3\xa9'
>>> ucd.name(u2[0]); ucd.name(u2[1])
'LATIN CAPITAL LETTER A WITH TILDE'
'COPYRIGHT SIGN'
>>>

Please understand: unicode_object.encode('x').decode('y) when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1').

Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing gibberish_unicode_object.encode('y').decode('x') as suggested in various answers to your question.

John Machin
  • 81,303
  • 11
  • 141
  • 189
4

value_uni.encode('utf8') or whatever encoding you need.

See http://docs.python.org/library/stdtypes.html#str.encode

UncleZeiv
  • 18,272
  • 7
  • 49
  • 77
  • 1
    Just to add. The above may seem the same, but the Unicode literal is made of code points that correspond to symbols and normal string is meaningless unless you know the encoding. – dhill May 06 '10 at 17:35
  • I get 'Andr\xc3\x83\xc2\xa9', isn't this different than 'Andr\xc3\xa9'? – Thierry Lam May 06 '10 at 17:35
  • @Thierry: That's what you get if you screw up and put UTF-8 in a unicode. – Ignacio Vazquez-Abrams May 06 '10 at 17:40
  • Yes, and this is predictable. I think there is no encoding that will convert Unicode code points in range(128,256) to respective bytes. Proove me wrong. – dhill May 06 '10 at 17:41
  • Converting to utf-8 will blow the \xc3 into two bytes! And converting to ascii won't work because \xc3 is not a in the ascii range. – I. J. Kennedy May 06 '10 at 18:03
  • @dhill: By design, latin1 aka ISO-8859-1 does exactly what you are talking about. The first 256 codepoints of Unicode are deliberately the same as latin1. Do this: `assert all(ord(chr(x).decode('latin1')) == x for x in range(256)); assert all(ord(unichr(x).encode('latin1')) == x for x in range(256))` – John Machin May 06 '10 at 22:17
  • @John Machin: True, but I meant Unicode encoding and I haven't put an adjective here. I reasoned that there must be at least one special character to build code points larger than a byte. – dhill May 07 '10 at 07:11
  • @dhill: What is a "Unicode encoding"? What do you mean by "special character"? What is a "codepoint larger than a byte"? – John Machin May 07 '10 at 12:54
1

The OP is not converting to ascii nor utf-8. That's why the suggested encode methods won't work. Try this:

v = u'Andr\xc3\xa9'
s = ''.join(map(lambda x: chr(ord(x)),v))

The chr(ord(x)) business gets the numeric value of the unicode character (which better fit in one byte for your application), and the ''.join call is an idiom that converts a list of ints back to an ordinary string. No doubt there is a more elegant way.

I. J. Kennedy
  • 24,725
  • 16
  • 62
  • 87
0

Simplified explanation. The str type is able to hold only characters from 0-255 range. If you want to store unicode (which can contain characters from much wider range) in str you first have to encode unicode to format suitable for str, for example UTF-8.

To do this call method encode on your str object and as an argument give desired encoding, for example this_is_str = value_uni.encode('utf-8').

You can read longer and more in-depth (and language agnostic) article on Unicode handling here: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Another excellent article (this time Python-specific): Unicode HOWTO

Bartosz
  • 6,055
  • 3
  • 27
  • 17
-1

It seems like

str(value_uni)

should work... at least, it did when I tried it.

EDIT: Turns out that this only works because my system's default encoding is, as far as I can tell, ISO-8859-1 (Latin-1). So for a platform-independent version of this, try

value_uni.encode('latin1')
David Z
  • 128,184
  • 27
  • 255
  • 279
  • I tried that but I get UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128). Which Python version are you using and on which OS? – Thierry Lam May 06 '10 at 17:35
  • Python 2.6.4 on Linux... although now that I think about it, it's possible my system's default encoding is set differently from yours. I'm not entirely sure what my default encoding is, though. – David Z May 06 '10 at 18:14
  • OK, got it, try the new method. – David Z May 06 '10 at 18:18
  • How do you check what your system default encoding is? – Thierry Lam May 06 '10 at 18:22
  • @Thierry Lam, `import sys; sys.getdefaultencoding()` – tgray May 06 '10 at 19:22
  • Not to be pushy, but it would be nice to lose the downvote since I've edited my answer to include the correct solution... – David Z May 07 '10 at 07:41