1

EDIT: I have seen all of the questions on SA for this and they all give me the error I'm asking about here- please can you leave it open so I can get some help?

I have a file I can read very simply with Bash like this: gzip -d -c my_file.json.gz | jq . This confirms that it is valid JSON. But when I try to read it using Python like so:

import json
import gzip
with gzip.open('my_file.json.gz') as f:
    data = f.read() # returns a byte string `b'`
json.loads(data)

I get the error:

json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 1632)

But I know it is valid JSON from my Bash command. I have been stuck on this seemingly simple problem for a long time now and have tried everything it feels like. Can anyone help? Thank you.

CClarke
  • 503
  • 7
  • 18
  • If your problem is reproducible even after you fix the binary error, please [edit] this to (probably fix that red herring and) provide a [mre] with data which exhibits the problem. With the diagnostics you have provided, we can only conclude that Python's JSON parser is more strict that the one in `jq`. In particular, `jq` tolerates input with multiple JSON structures each on a separate line, but that's not valid JSON. – tripleee Mar 18 '22 at 11:47
  • I updated with another duplicate to explain that part. – tripleee Mar 18 '22 at 11:55

3 Answers3

4

Like the documentation tells you, gzip.open() returns a binary file handle by default. Pass in an rt mode to read the data as text:

with gzip.open("my_file.json.gz", mode="rt") as f:
    data = f.read()

... or separately .decode() the binary data (you then obviously have to know or guess its encoding).

If your input file contains multiple JSON records on separate lines (called "JSON lines" or "JSONS"), where each is separately a valid JSON structure, jq can handle that without any extra options, but Python's json module needs you to specify your requirement in more detail, perhaps like this:

with gzip.open("my_file.json.gz", mode="rt") as f:
    data = [json.loads(line) for line in f]
tripleee
  • 175,061
  • 34
  • 275
  • 318
0

You can take a look at this post: https://stackoverflow.com/a/39451012/10642508 Seem to be the same issue. That code should work:

with gzip.open(jsonfilename, 'r') as fin:
    data = json.loads(fin.read().decode('utf-8'))
  • 1
    Please don't post links to other questions as answers. Once you earn enough reputation, you will be able to vote to close as duplicate. – tripleee Mar 18 '22 at 11:25
0

It's the read mode and the decode that need to be modified/specified

Sample code

import gzip

f=gzip.open('a.json.gz','rb')
file_content=f.read()
print(file_content.decode())
madmatrix
  • 205
  • 1
  • 4
  • 12