0

I have a problem when trying to ingest data from s3 http link to python by using requests library. My code is as follows:

import gzip
import requests
def parse(url: str):
    r = requests.get(url, stream=True)
    data = gzip.decompress(r.content)

    raw_data = []
    for line in data.iter_lines():
        raw_data.append(j.loads(line.decode("utf-8")))
    return raw_data
    

raw_data = parse('https://s3-eu-west-1.amazonaws.com/path/of/bucket.json.gz')

When I run this, the code is running without giving error but it doesn't end. It looks like stuck. But the size of data 3.1 GB and I was not expecting too much. (Actually I waited more than 1 hour)

What can be the problem? Is there a suggestion from you?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
bbgghh
  • 79
  • 1
  • 2
  • 9
  • 1
    I recommend that you add some debugging lines to see what is happening. For example, inside the `for` loop, print the contents of `line` and `raw_data`. – John Rotenstein Nov 26 '22 at 21:44
  • Do you have enough available memory to store both the GZIP and the unzipped contents? (did you try running a resource monitor in parallel? If the memory use is going up, then it's not dead, it's working...) – Adam Smooch Nov 26 '22 at 21:46
  • @JohnRotenstein I added debugging line and line starting with "r = requests" working fine but the waiting problem starts in r.content – bbgghh Nov 26 '22 at 22:04
  • 1
    I'd suggest downloading the file, then extracting it. E.g. https://stackoverflow.com/questions/37573483/progress-bar-while-download-file-over-http-with-requests By doing it this way, you'll be able to 1) get feedback about how fast the download is, and 2) check whether this is a requests problem or a gzip problem. – Nick ODell Nov 26 '22 at 22:20
  • @NickODell I tried your method and it is downloaded in 20 minutes – bbgghh Nov 26 '22 at 22:48

0 Answers0