0

I am using Python urllib.request library to read data from API.

response = compatible_urllib.urlopen(request).read()

For one of responses from API I am getting following string as a value of response variable:

"title":"Istina o slu\u010daju Harryja Queberta"

Instead of \u010d, Croatian character č should be displayed but I can't get it. I tried many solutions found on Stack Overflow but didn't find solution. This is what I tried so far:

        #res = response.decode("utf-8")
        #res = response.decode("utf-8").replace("\\", "")
        #res = response.decode("cp1250").replace("\\", "")
        #res = response.encode().decode("unicode-escape") # ascii codec can't encode character \u0160
        #res = response.encode("utf-8")
        #res = response.encode('utf-8', 'ignore').decode('utf-8')

What could be a solution for this? (string is written to database as \u010d, so it is not just terminal display issue)

punky
  • 125
  • 2
  • 12
  • 2
    What exactly do you mean by "displayed"? How do you display? The variable seems to contain the correct unicode character: [`\u010d` is exactly `č`](https://www.compart.com/en/unicode/U+010D). It's just your printing method decides to print the string non-verbatim. Wild guess: if you're using Python's json module, try adding `ensure_ascii=False` – yeputons Jul 20 '23 at 20:16
  • When I print res variable \u010d is displayed, and also I write that string to database and it is also displayed as \u010d. – punky Jul 20 '23 at 20:22
  • 1
    The response is JSON. Use the `requests` module, and use `response.json()` to decode. – Mark Tolonen Jul 20 '23 at 20:24
  • Try something like `'\\u010d'.encode( 'raw_unicode_escape').decode( 'unicode_escape')` which returns `'č'`… – JosefZ Jul 20 '23 at 20:37
  • 1
    There is no conversion to do: `'\u010d'` means the same thing that `'č'` does, and if you try `'\u010d'` on a line by itself at the REPL (interpreter prompt), Python will report back `'č'`, at least on properly configured terminals, because that's the standard representation of the string. If you are reading a JSON response from a web API, you should use proper tools to handle JSON, since they will automatically handle escaping within the JSON format. But it is vital that you understand what data you **actually have** versus how it is **displayed**, as well as the ways to check. – Karl Knechtel Jul 20 '23 at 20:37
  • As @KarlKnechtel indicated for python '\u010d' is the same as 'č' and it is best to use a JSON parser to parse JSON files into Python. JSON uses UTF-8, but characters can be escaped so the JSON can be handled by systems that can not support Unicode. Escaped JSON matches Python Unicode escapes for characters in the BMP, but not for SMP, SIP and TIP characters, where JSON is more akin to UTF-16 and CESU-8. – Andj Jul 21 '23 at 12:03

0 Answers0