0

I do have a json file I am trying to read.

The Json includes the following text in the file, which I am trying to decode with the code following this text.

Silikon-Ersatzpl\u00ef\u00bf\u00bdttchen Regenlichtsensor

with open("file_name", encoding = "utf-8") as file:
        pdf_labels = json.loads(file.read())

When I try to load it with the json module and specify utf-8 encoding I get some weird results.

"\u00ef\u00bf\u00bd" will become "�" instead of a desired "ä"

The desired ouput should look like the following.

Silikon-Ersatzplättchen Regenlichtsensor

Please don´t be harsh, this is my first question :)

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • 1
    The file content is corrupted, probably due to having been incorrectly decoded at some point. How was the file created? – snakecharmerb Apr 04 '21 at 17:21
  • The UTF-8 sequence for ä is `\u00e4`, so your data certainly is not UTF-8. According to https://unicode-table.com/en/00E4/ it is also not UTF-16 or UTF-32. This means that your input file was either saved in a different encoding, or something went wrong and the characters were replaced with some garbage. – jurez Apr 04 '21 at 17:23
  • The first paragraph of this answer is somewhat relevant: https://stackoverflow.com/questions/6366912/reading-file-from-windows-and-linux-yields-different-results-character-encoding/6367675#6367675 . Long story short: I agree with previous commenters; the data in the file is broken. – Ture Pålsson Apr 04 '21 at 17:25
  • @snakecharmerb I need to check with a colleague, who created them :) Thank you for the answer. – Stephan.tplnk Apr 04 '21 at 17:30
  • @jurez Thanks for the link and the insight, – Stephan.tplnk Apr 04 '21 at 17:30
  • @TurePålsson Thanks for the link. It is what I assumed but hoped not to be true haha – Stephan.tplnk Apr 04 '21 at 17:31
  • Typically this can be fixed by something like `fixed = broken_string.encode('latin-1').decode('utf-8')`, where `latin-1` could also be some other 8-bit encoding such as cp1252. But trying latin-1 or cp1250-cp1252 all result in a still-broken string, so understanding how it was created is probably the best way to get the solution. – snakecharmerb Apr 04 '21 at 17:31
  • @snakecharmerb I tried various solutions going that direction, but I could not make it work unfortunately – Stephan.tplnk Apr 04 '21 at 17:38
  • The file content is corrupted (probably via some previous incorrect manipulation) as `'\u00EF\u00BF\u00BD'.encode('latin1') == '\uFFFD'.encode('utf-8')` and `'\uFFFD'` is *Replacement Character*. – JosefZ Apr 04 '21 at 18:13
  • @JosefZ Thank you. It was indeed already an issue with the encoding and not with my decoding approach. – Stephan.tplnk Apr 09 '21 at 08:06

0 Answers0