Issue trying to decode utf-8 encoded json in python

Question

I do have a json file I am trying to read.

The Json includes the following text in the file, which I am trying to decode with the code following this text.

Silikon-Ersatzpl\u00ef\u00bf\u00bdttchen Regenlichtsensor

with open("file_name", encoding = "utf-8") as file:
        pdf_labels = json.loads(file.read())

When I try to load it with the json module and specify utf-8 encoding I get some weird results.

"\u00ef\u00bf\u00bd" will become "ï¿½" instead of a desired "ä"

The desired ouput should look like the following.

Silikon-Ersatzplättchen Regenlichtsensor

Please don´t be harsh, this is my first question :)

The file content is corrupted, probably due to having been incorrectly decoded at some point. How was the file created? — snakecharmerb, Apr 04 '21 at 17:21
The UTF-8 sequence for ä is `\u00e4`, so your data certainly is not UTF-8. According to https://unicode-table.com/en/00E4/ it is also not UTF-16 or UTF-32. This means that your input file was either saved in a different encoding, or something went wrong and the characters were replaced with some garbage. — jurez, Apr 04 '21 at 17:23
The first paragraph of this answer is somewhat relevant: https://stackoverflow.com/questions/6366912/reading-file-from-windows-and-linux-yields-different-results-character-encoding/6367675#6367675 . Long story short: I agree with previous commenters; the data in the file is broken. — Ture Pålsson, Apr 04 '21 at 17:25
@snakecharmerb I need to check with a colleague, who created them :) Thank you for the answer. — Stephan.tplnk, Apr 04 '21 at 17:30
@TurePålsson Thanks for the link. It is what I assumed but hoped not to be true haha — Stephan.tplnk, Apr 04 '21 at 17:31
Typically this can be fixed by something like `fixed = broken_string.encode('latin-1').decode('utf-8')`, where `latin-1` could also be some other 8-bit encoding such as cp1252. But trying latin-1 or cp1250-cp1252 all result in a still-broken string, so understanding how it was created is probably the best way to get the solution. — snakecharmerb, Apr 04 '21 at 17:31
@snakecharmerb I tried various solutions going that direction, but I could not make it work unfortunately — Stephan.tplnk, Apr 04 '21 at 17:38
The file content is corrupted (probably via some previous incorrect manipulation) as `'\u00EF\u00BF\u00BD'.encode('latin1') == '\uFFFD'.encode('utf-8')` and `'\uFFFD'` is *Replacement Character*. — JosefZ, Apr 04 '21 at 18:13
@JosefZ Thank you. It was indeed already an issue with the encoding and not with my decoding approach. — Stephan.tplnk, Apr 09 '21 at 08:06

Issue trying to decode utf-8 encoded json in python

0 Answers0