Issues reading in large .gz files

Question

I am reading in a large zipped json file ~4GB. I want to read in the first n lines.

with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
    line_n = f.readlines(1)
    print(ast.literal_eval(line_n[0])['events']) # a dictionary object

This works fine when I want to read a single line. If now try and read in a loop e.g.

no_of_lines = 1
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
    for line in range(no_of_lines):
        line_n = f.readlines(line)
        print(ast.literal_eval(line_n[0])['events'])

My code takes forever to execute, even if that loop is of length 1. I'm assuming this behaviour has something to do with how gzip read files, perhaps when I loop it tries to obtain information about the file length which causes the long execution time? Can anyone shed some light on this and potentially provide an alternative way of doing this?

An edited first line of my data: ['{"events": {"category": "EVENT", "mac_address": "123456", "co_site": "HSTH"}}\n']

`readlines` load the whole file in the memory, perhaps you should use `readline` without the 's' — jvx8ss, Jan 17 '23 at 15:48
Where the code says `line_n = f.readlines(line)`, exactly what do you expect this to mean? What do you think will be the value of `line`, and how much data do you expect will be read? Did you try to test this theory, for example, by checking the value of `line_n` afterward? (Did you try to check the value of `line` before the `.readlines` call? Did you read the documentation for `.readlines`?) — Karl Knechtel, Jan 17 '23 at 15:52
Why are you trying to parse JSON with `ast.literal_eval()` and not a JSON library? Assuming this even has valid JSON strings per line. — gre_gor, Jan 17 '23 at 15:54
the input json file is not a json file or it should not be possible to read only one line and it to be valid. Maybe paste a small example of your input? — Jean-François Fabre, Jan 17 '23 at 15:55
"I am reading in a large zipped json file ~4GB." If it is **actually** JSON, then it **cannot** be parsed line-by-line. It is possible that you have JSONL, a related format where this will work (every line has a separate JSON document on it). — Karl Knechtel, Jan 17 '23 at 15:55
and it's not possible to parse a json file reliably with ast.literal_eval as if the file contains booleans or null ast will choke on it. Use `json` — Jean-François Fabre, Jan 17 '23 at 15:56
@gre_gor my first line looked like `['{"events": {"category": "EVENT", "mac_address": "123456", "ver": " ", ... }}\n']` so i thought it appropriate to use `ast`. I wasn't aware of any json libs that can do this. — user17033672, Jan 17 '23 at 16:02
`'{"events": {"category": "EVENT", "mac_address": "123456", "ver": " "}}\n'` is a valid JSON string parsable with the builtin JSON library. — gre_gor, Jan 17 '23 at 16:10
@KarlKnechtel I don't think so? In that example they are reading all line, in this example I want to read the first n lines in. I see no way to incorporate that accepted answer into my problem. — user17033672, Jan 17 '23 at 16:22
Does this answer your question? [How to read first N lines of a file?](https://stackoverflow.com/questions/1767513/how-to-read-first-n-lines-of-a-file) — gre_gor, Jan 17 '23 at 16:29
@gre_gor using a tweaked version of the suggested answer I get a `UnicodeDecodeError`, so unfortunately no — user17033672, Jan 17 '23 at 16:35
"In that example they are reading all line" - yes, **one at a time**, rather than up front. What the rest of us have been trying to tell you is that the current code *appears to be* reading everything into memory up front and then *processing* a specific number of lines. We have also been hinting at ways to *check* whether that is the case. — Karl Knechtel, Jan 18 '23 at 04:29
"Can someone explain to me why my question was downvoted?" Please read [ask] and https://ericlippert.com/2014/03/05/how-to-debug-small-programs/, and try to paint a clearer picture of *what is actually happening* when the code runs - by *consciously investigating that*. See also [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/questions/261592/). "@gre_gor using a tweaked version of the suggested answer I get a UnicodeDecodeError, so unfortunately no" That is a **separate issue**. — Karl Knechtel, Jan 18 '23 at 04:31

score 0 · Answer 1 · answered Jan 17 '23 at 15:51

0

You are using the readlines() method, which reads all lines from a file simultaneously. This can cause performance issues when reading huge files, as Python needs to load all the lines into memory at once.

An alternative is to use the iter() method to iterate over the lines of the file, without having to load all the lines into memory at once:

with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
    for line in f:
        print(ast.literal_eval(line)['events'])

answered Jan 17 '23 at 15:51

Pedro Viegas

1
4

that means that the input json file is not a json file or it should not be possible to read only one line and it to be valid – Jean-François Fabre Jan 17 '23 at 15:54
this does not produce the expected behaviour - I suspect because my compressed .json is not infact a json. – user17033672 Jan 17 '23 at 16:05
Maybe paste a small example of your input? – Pedro Viegas Jan 17 '23 at 16:26
@PedroViegas I have added small example to the q – user17033672 Jan 17 '23 at 16:36

Issues reading in large .gz files

1 Answers1