1

I have an encoding issue with strings I get from an external source. This source sends the strings encoded to me and I can decode them only if they are part of the script's code. I've looked at several threads here and even some recommended tutorials (such as this one) but came up empty.

For example, if I run this:

python -c 'print "gro\303\237e"'

I get:

große

Which is the correct result.

But If I use it in a script, such as:

import sys
print sys.argv[1]

and call it like test.py "gro\303\237e", I get:

gro\303\237e

I intend to write the correct string to syslog, but I can't seem to get this to work.

Some data on my system: - Python 2.7.10 - CentOS Linux - LANG=en_US.UTF-8 - LC_CTYPE=UTF-8

I will appreciate any help, please let me know if you need more information. Thanks!

Anil_M
  • 10,893
  • 6
  • 47
  • 74
n3g4s
  • 123
  • 1
  • 7
  • 2
    Just call your script with `test.py "große"`. – syntonym Apr 12 '16 at 11:10
  • I would, but I don't control the input string. It arrives already encoded. Thanks. – n3g4s Apr 12 '16 at 11:16
  • `\xxx` in a *string literal* is being interpreted as an escape sequence, but **only** in a string literal. – More than that though, `\303\237` as escape sequence for "ß" is rather... unusual. Seems like the encoding of that string went wrong. You can get the right result if you *decode* it (in)correctly in the same way, but what kind of escaping is that supposed to be and can you correct it at the source? – deceze Apr 12 '16 at 11:18
  • @deceze Unfortunately I don't control the source. And I agree wit you: the encoding is strange and I can't really map it, but python seems to understand it. – n3g4s Apr 12 '16 at 11:20
  • 1
    You can unescape (rather than decode, although it's kinda the same thing) via [this SO answer](http://stackoverflow.com/questions/1885181/how-do-i-un-escape-a-backslash-escaped-string-in-python) – syntonym Apr 12 '16 at 11:21
  • 1
    Looks like the escape sequences represent raw bytes, and interpreting those bytes as UTF-8 yields the desired result. – deceze Apr 12 '16 at 11:22
  • @syntonym that works, actually! Want to put it as an answer, so I can mark it as right? – n3g4s Apr 12 '16 at 13:19
  • @n3g4s Are you sure that deceze's commentar does not apply to your case? Can you show us some of the original data you get? Also how do you feed the external data to your program? – syntonym Apr 12 '16 at 13:31
  • This is data that I get. The example is real data (I truncated the whole sentence which was "Eine gro\303\237e Umarmung"). I'm feeding it via a direct call to python, exactly the way I've written here. – n3g4s Apr 12 '16 at 13:49
  • Do you type that manually? – syntonym Apr 12 '16 at 14:06
  • `\303\237` is an octal escape code, equivalent to hexadecimal `\xc3\x9f`, which is UTF-8 for Unicode `u'\xdf'`, which is `ß`. So something like `.decode('string-escape').encode('latin1').decode('utf8')` should work. It's double-encoded. – Mark Tolonen Apr 12 '16 at 14:20

2 Answers2

0

This will work:

import sys
import ast
print ast.literal_eval('b"%s"' % sys.argv[1]).decode("utf-8")

But please read about literal_eval first to make sure it suits your needs (I think it should be safe to use but you should read and make sure).

Ronen Ness
  • 9,923
  • 4
  • 33
  • 50
  • thanks for the tip, but I get a different error with that code: UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128) – n3g4s Apr 12 '16 at 12:23
  • that's weird, I tried 'python test.py "gro\303\237e"' and it works on windows and linux (ubuntu) with python 2.7.6. what's the exact input that doesn't work for you? – Ronen Ness Apr 12 '16 at 12:38
  • That's the input: "gro\303\237e" – n3g4s Apr 12 '16 at 12:42
  • yeah but how did you call the python script? same way I did? you can check it out online here btw: https://repl.it/CEg1/0 – Ronen Ness Apr 12 '16 at 12:43
  • I call it like I posted in the question: test.py "gro\303\237e". The link you sent is for a string included in the code, which is when it works, like I said in my question. – n3g4s Apr 12 '16 at 13:15
  • ahh I see what happened here. try to add 'b' before the quotes in the literal_eval (I updated the answer you can just copy it again) – Ronen Ness Apr 12 '16 at 13:47
  • The UnicodeEncodeError could be your terminal settings. – cdarke Apr 12 '16 at 14:10
  • @Ness I tried it again, with the change, but I got the same error. Thanks anyway! – n3g4s Apr 12 '16 at 18:12
0

If you really have the chars gro\303\237e which is something else as "gro\303\237e" (the first one are the chars g r o \ 3 0 3 \ 2 3 7, the second one is the chars g r o ß e) you can use decode("escape_string") as described in this SO answer

Note that this is probably an encoding error whoever produced the data. So it may contain other errors that you can not fix with this method.

Community
  • 1
  • 1
syntonym
  • 7,134
  • 2
  • 32
  • 45
  • Thanks for this, @syntonym. This helped a lot and turned out to be the only answer that worked. Thanks as well to all others who chipped in. Your contribution was important! – n3g4s Apr 12 '16 at 18:10