Result is not responding when trying to load json.gz data from s3 to python

Question

I have a problem when trying to ingest data from s3 http link to python by using requests library. My code is as follows:

import gzip
import requests
def parse(url: str):
    r = requests.get(url, stream=True)
    data = gzip.decompress(r.content)

    raw_data = []
    for line in data.iter_lines():
        raw_data.append(j.loads(line.decode("utf-8")))
    return raw_data
    

raw_data = parse('https://s3-eu-west-1.amazonaws.com/path/of/bucket.json.gz')

When I run this, the code is running without giving error but it doesn't end. It looks like stuck. But the size of data 3.1 GB and I was not expecting too much. (Actually I waited more than 1 hour)

What can be the problem? Is there a suggestion from you?

I recommend that you add some debugging lines to see what is happening. For example, inside the `for` loop, print the contents of `line` and `raw_data`. — John Rotenstein, Nov 26 '22 at 21:44
Do you have enough available memory to store both the GZIP and the unzipped contents? (did you try running a resource monitor in parallel? If the memory use is going up, then it's not dead, it's working...) — Adam Smooch, Nov 26 '22 at 21:46
@JohnRotenstein I added debugging line and line starting with "r = requests" working fine but the waiting problem starts in r.content — bbgghh, Nov 26 '22 at 22:04
I'd suggest downloading the file, then extracting it. E.g. https://stackoverflow.com/questions/37573483/progress-bar-while-download-file-over-http-with-requests By doing it this way, you'll be able to 1) get feedback about how fast the download is, and 2) check whether this is a requests problem or a gzip problem. — Nick ODell, Nov 26 '22 at 22:20
@NickODell I tried your method and it is downloaded in 20 minutes — bbgghh, Nov 26 '22 at 22:48

Result is not responding when trying to load json.gz data from s3 to python

0 Answers0