0

I receive a text string from a third party api with garbled character encodings. When I print that string to the command line, the string contains words like

  • Zäune instead of Zäune
  • Gartenmöbel instead of Gartenmöbel

etc.

What can I do, to fix the incoming text string with python 2.7, so it prints properly to the command line?

Thanks

Jabb
  • 3,414
  • 8
  • 35
  • 58

3 Answers3

2
In [36]: print('Zäune'.decode('utf-8').encode('cp1252').decode('utf-8').encode('latin-1'))
Zäune

In [37]: print('Gartenmöbel'.decode('utf-8').encode('cp1252').decode('utf-8').encode('latin-1'))
Gartenmöbel

I found this chain of encodings guess_chain_encodings.py which performs a brute-force search:

In [51]: 'Zäune'
Out[51]: 'Z\xc3\x83\xc6\x92\xc3\x82\xc2\xa4une'

In [52]: 'Zäune'
Out[52]: 'Z\xc3\xa4une'

Running

guess_chain_encodings.py "'Z\xc3\x83\xc6\x92\xc3\x82\xc2\xa4une'" "'Z\xc3\xa4une'"

yielded

'Z\xc3\x83\xc6\x92\xc3\x82\xc2\xa4une'.decode('utf_8').encode('cp1254').decode('utf_8_sig').encode('palmos')

A little playing around suggested that cp1254 could be replaced by the (more common?) cp1252, and utf_8_sig could be replaced by utf-8, and the odd palmos could be replaced by latin-1.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Could you add some explanation for this decode/encode chain, please? – Wojciech Walczak Apr 18 '14 at 12:32
  • 1
    @WojciechWalczak: `cp1252` is a common encoding used by Windows. `utf-8` is a common encoding used on Unix. So part of the problem appears to be due to a Windows computer outputting a `cp1252` encoded string and a Unix computer interpreting that string as a `utf-8` encoded string, or vice versa. Meanwhile, `latin-1` is an encoding which converts unicode code points to their literal byte values. This suggests that compounding the problem is some machine which is receiving code points and interpreting them as bytes or vice versa. – unutbu Apr 18 '14 at 12:52
  • Hi, the string with the garbled characters is of type "unicode". When I do print garbledString.decode('utf-8').encode('cp1252').decode('utf-8').encode('latin-1'), I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u200e' in position 10739: ordinal not in range(256) – Jabb Apr 18 '14 at 13:00
  • `unicode` is always *encoded*, and `bytes` are always decoded. So if you have `unicode`, then skip the first `.decode()` and instead try: `garbledString.encode('cp1252').decode('utf-8').encode('latin-1'`). This will leave you with a `utf-8` encoded string. If you are on Windows, you will want to decode this to obtain the desired `unicode`: `garbledString.encode('cp1252').decode('utf-8').encode('latin-1').decode('utf-8')`. – unutbu Apr 18 '14 at 13:02
  • did it.. this gives: UnicodeEncodeError: 'charmap' codec can't encode character u'\u200e' in position 16351: character maps to – Jabb Apr 18 '14 at 13:07
  • Please post: `print(repr(garbledString))`. – unutbu Apr 18 '14 at 13:08
  • You might also want to try: `garbledString.encode('cp1252').decode('utf-8').encode('cp1252').decode('utf-8')` – unutbu Apr 18 '14 at 13:14
  • Also note that the string you are receiving from the 3rd-party API is bytes, not unicode. So to address this problem at the source, we should be dealing with *that* string, not the unicode `garbledString`. – unutbu Apr 18 '14 at 13:35
1

The strings seem to be UTF-8 encoded twice.

JensG
  • 13,148
  • 4
  • 45
  • 55
0

Notice also the console encoding - sometimes you can see your printed strings fine in the app, but it could fail to print in the console. Here's very good guide about Unicode in Python and its using techniques.

rook
  • 5,880
  • 4
  • 39
  • 51