1

I am using an API in Python v2.7 to obtain a string, the content of which is unknown. The content can be in English, German or French. The variable name assigned to the returned string is 'category'. An example of a returned value for the variable category is:-

"temp\\u00eate de poussi\\u00e8res"

I have tried category.decode('utf-8') to decode the string into, in the above case, French, but unfortunately it still returns the same value, with an additional unicode 'u' at the beginning when I print the result of category.decode('utf-8').

u'"temp\\u00eate de poussi\\u00e8res'

I also tried category.encode('utf-8') but it returns the same value (minus the 'u' that precedes the string:-

'"temp\\u00eate de poussi\\u00e8res"'

Any suggestions?

tripleee
  • 175,061
  • 34
  • 275
  • 318
thefragileomen
  • 1,537
  • 8
  • 24
  • 40

2 Answers2

2

I think you have literal slashes in your string, not unicode characters.

That is, \u00ea is the unicode escape encoding for ê, but \\u00ea is actually a slash (escaped), two zeros and two letters.

Similar for the quotation marks, your first and last characters are literal double quotes ".

You can convert those slash plus codepoint into their equivalent characters with:

x = '"temp\\u00eate de poussi\\u00e8res"'
d = x.decode("unicode_escape")
print d

The output is:

"tempête de poussières"

Note that to see the proper international characters you have to use print. If instead you just write d in the interactive Python shell you get:

 u'"temp\xeate de poussi\xe8res"'

where \xea is equivalent as \u00ea, that is the escape sequence for ê.

Removing the quotes, if required, is left as an exercise to the reader ;-).

rodrigo
  • 94,151
  • 12
  • 143
  • 190
  • Thanks @rodrigo. Can you explain further what you mean at the end? I made the changes as you suggested but I get the below error returned. This is returned as a response to a print command:- UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 5: ordinal not in range(128) – thefragileomen Dec 04 '18 at 15:23
  • @thefragileomen: Can you specify what sencence you want me explained? The one about the quotes? About your new error, `ascii` codec can only code for ASCII characters and `ê` is not an ASCII character. Why doing a `print` implies an ASCII encoding is another matter, usually this happens because you are redirecting the output of your program and `python` assumes that all files are ASCII unless said otherwise. Please see this [answer](https://stackoverflow.com/a/4546129/865874) for all the details. – rodrigo Dec 04 '18 at 15:31
  • Your `print` wants to convert the string to ASCII, probably because you haven't set it up to use a sane (ideally Unicode-compatible) system encoding. Lots of these issues go away if you simply switch to Python 3 and a properly Unicode-compatible locale. – tripleee Dec 04 '18 at 15:33
  • Thanks @rodrigo. If you can explain more about the new error. I thought this was related sorry and not a new matter. – thefragileomen Dec 04 '18 at 15:35
  • Unfortunately @tripleee, I am unable to move to Python 3 due to restrictions in the environment I am deploying this code (restrictions that are out of my control) – thefragileomen Dec 04 '18 at 15:36
  • The link in the earlier comment to an answer to a related question is helpful even if your immediate problem isn't with printing to a file. – tripleee Dec 04 '18 at 15:45
  • About the new error, stdout is actually a file, so you write bytes into it, not characters. When your program writes to stdout, Python will convert unicode strings into bytes using the appropriate encoding. In any sane console that will be UTF-8 (sane being anything of this century and not Windows console). But if your stdout is redirected, because you do it manually (`program.py > out.txt`) or your program is run as part of a batch process (cgi, scheduled task...) then the encoding of stdout may be undefined and fall back to ASCII. – rodrigo Dec 04 '18 at 15:45
1

It looks like the API uses JSON. You can decode it with the json module:

>>> import json
>>> json.loads('"temp\\u00eate de poussi\\u00e8res"')
u'temp\xeate de poussi\xe8res'
>>> print(json.loads('"temp\\u00eate de poussi\\u00e8res"'))
tempête de poussières
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251