decoding issue while parsing JSON [python]

Question

I am reading a JSON file in Python which has lots of fields and values (~8000 records). Env: windows 10, python 3.6.4; code:

import json
json_data = json.load(open('json_list.json'))
print (json_data)

With this I get an error. Below is the stack trace:

  json_data = json.load(open('json_list.json'))
  File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
    return loads(fp.read(),
  File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>

Along with this I have tried

import json
with open('json_list.json', encoding='utf-8') as fd:
     json_data = json.load(fd)
     print (json_data)

with this my program runs for a long time then hangs with no output.

I have searched almost all topics related to this and could not find a solution.

Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.

Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.

Here is what the file looks like around the reported error:

>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
 b'":"2017-04-10","storage_size_gb":"84.747')

Something is wrong if you are somehow ending up decoding cp1252. JSON is specifically UTF-8. Troubleshooting is hard as the Python trace doesn't show you the problematic data -- if you can use `try`/`except` you can at least print the problematic input as a first step towards debugging this, but with a large input, just waiting for it to repro is slow and painful. — tripleee, Jan 15 '18 at 08:56
Thanks for the response, is there any way i can change the data file to some other extension and then read it and convert back to json ? — , Jan 15 '18 at 09:11
Also to include , how can I use try/except dow you want me to check with other encoding formats ? If you could help it would be great — , Jan 15 '18 at 09:13
Python doesn't care what the file name is. If you don't try to decode it as JSON, you can do something else with it first, or instead. — tripleee, Jan 15 '18 at 09:30
Adding information about decoding errors is not entirely straightforward. In the `except UnicodeDecodeError as err:` if you can extract the number 7977319 from the string representation of the error message, you can then `raise('error around {0}, error message {1}'.format(repr(input[7977310:7977328]), err))` to see the raw string around where the decoding failed. — tripleee, Jan 15 '18 at 09:32
Oh you need to do a bit more than that. The `input` variable obviously isn't correct, but you need to refactor the code around the exception handling so that the input you read is in a variable (probably use a better name than `input`) which you can then extract the problematic snippet out of. — tripleee, Jan 15 '18 at 09:48
See e.g. https://stackoverflow.com/questions/46180610/python-3-unicodedecodeerror-how-do-i-debug-unicodedecodeerror for a bit of an elaboration. You don't really need to try to decode JSON, just read the file or stream at the problematic offset and determine how it's broken. — tripleee, Jan 15 '18 at 09:50
Okay cool thanks :-) will going through it ,just a question I dont need all the parts in that json only few parts so can I skip the problematic parts by directly selecting the required "key: values" ? — , Jan 15 '18 at 09:52
If it's all a monolithic single line then it's hard to avoid reading the whole thing into memory. If it's split over multiple lines, you can try to approach it with some sort of incremental parsing, but that's harder. (Well, you can read a buffer of the long single line and try to chunk it in some what for similar effect.) — tripleee, Jan 15 '18 at 09:53
I saw debugging it in my json file there are places where in some places values are like this "license":"" , is this causing the issue? how can I skip these values ? as these are fields not required. — , Jan 15 '18 at 10:51
No, that's not a problem, it's a dictionary key whose value is empty. Encoding errors will not even be viewable as proper UTF-8; that's the problem here. If you manage to read the file as bytes you can scan it for invalid UTF-8 sequences to figure out what's wrong. — tripleee, Jan 15 '18 at 15:39
Try this: https://gist.github.com/tripleee/b368f6a37492ab0ab3a896f6f1a94a92 — tripleee, Jan 16 '18 at 09:38
Did that actually allow you to solve your problem? I'm thinking maybe I should post it as an answer to the question I linked to above, and suggest this as a duplicate of that question. — tripleee, Jan 16 '18 at 13:46
I tried the above one , my program just hangs for a long time and doesn't gives any output unfortunately, Not sure what can be done — , Jan 16 '18 at 13:52
If you can extract the data around offset 7977319 then it should be quicker to see what's wrong. I don't know of a good way to do that on Windows, unfortunately; with Python you should be able to open the file in binary mode and `seek()` to a position a few hundred bytes before the troublesome spot, read a chunk, and write it out to a new file you opened for writing in binary mode. — tripleee, Jan 16 '18 at 13:54
I have done that and pasted above,but I don't understand what does it make sense. — , Jan 16 '18 at 14:03
Can i open the file in binary mode and then convert it into proper encoding in json ? — , Jan 16 '18 at 14:09
You need to seek to a bit before the error. We already know that the \x81 is problematic but we need to see a few bytes before it to make proper sense of it. Looks like the bytes later on in the file are proper UTF-8 and if I'm allowed to guess, this character should be ó or í (but I don't really speak Spanish... er, Portuguese?). — tripleee, Jan 16 '18 at 14:17
\xc2 \x89 looks wrong too, but for other reasons. Can you perhaps just edit out this one problematic snippet from the JSON? (Can't tell with just 100 bytes if it's a big or small thing.) — tripleee, Jan 16 '18 at 14:20
For what it's worth, \xc3 \x83 is [U+00C3](http://www.fileformat.info/info/unicode/char/00c3/index.htm) and \xc2 \x89 is [U+0089](https://www.fileformat.info/info/unicode/char/0089/index.htm) which is a control character. I guess the string should be `INGLÉS` or something like that, but it's hard or impossible without further knowledge about whatever generated this data to make informed guesses. — tripleee, Jan 16 '18 at 14:23

tripleee · Accepted Answer · 2021-09-10T11:48:11.553

The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.

Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.

The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).

Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.

Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

So can I convert back it to latin-1 and then parse the data ? or any other alternatives are there — , Jan 16 '18 at 14:34

decoding issue while parsing JSON [python]

1 Answers1

Linked