Downloading Huge data more then 1gb using requests lib and saving it to a file using python

Question

Reading huge data from requests and saving it to a file. This works for < 1G of data but for more than 1GB to 5GB it takes a huge time and I have not seen the data saved to a file which gives me connection errors.

Piece of Code I tried:

with request.get(url....) as r:
   with open(file ,‘wb’) as f:
      for chunk in r.iter_content(chunk_size = 10000):
         if chunk:
            f.write(chunk)
            f.flush()

Any suggestions here to accelerate the download process to save it to a file will be helpful. I tried with different chunk size and commenting flush but not much improvement.

with request.get(url....) as r:
   with open(file ,‘wb’) as f:
      for chunk in r.iter_content(chunk_size = 10000):
         if chunk:
            f.write(chunk)
            f.flush()

This gives results for less than 1GB of data but for more than 1 GB of data it takes huge time and gives an error of connection from the source from where we fetched the data using requests.

Have you tried `requests.get(url, stream=True)` ? – CaptAngryEyes May 28 '23 at 15:41 — CaptAngryEyes, May 28 '23 at 15:41
Yes I have tried it but not much impact – G Naik May 28 '23 at 16:36 — G Naik, May 28 '23 at 16:36

score 1 · Answer 1 · answered May 28 '23 at 15:58

1

I think the best approach is to do a parallel download

step 1: pip install pypdl

step 2: for downloading the file you could use

from pypdl import Downloader

dl = Downloader()
dl.start('http://example.com/file.txt', 'file.txt')

by: Jishnu

There are differents options in the source stack question

source: Downloading a large file in parts using multiple parallel threads

answered May 28 '23 at 15:58

Griner

151
1
8

Thanks Jishnu. This is something I have not tried and I will try this and see the progress for larger around 5 G-Block data – G Naik May 28 '23 at 16:45
Jishnu how we can give Params like auth, proxies , headers , verify = False in the downloader object ? – G Naik May 29 '23 at 07:53
@GNaik take a look in the pkg code, I think you can easily adapt for your use case https://github.com/m-jishnu/pypdl/blob/main/pypdl/main.py – Griner May 30 '23 at 15:03

score 1 · Answer 2 · answered May 28 '23 at 15:59

1

As noted in the comments, you need to pass stream=True to requests.get(), or you'll end up with lots of memory use. You may be doing that already - it's not clear from your question.

if chunk:

This step isn't required - iter_content() won't give you empty chunks.

f.flush()

This is slowing your code down. It turns off buffering and tells Python to finish the previous write before it begins the next write. It's much faster to queue up as many writes as possible.

It's also not required. When the with block exits, the file is closed, which implicitly flushes the remaining writes in the file.

For those reasons, you should delete this line of code.

answered May 28 '23 at 15:59

Nick ODell

15,465
3
32
66

Thanks Nick for the suggestions . These are already tried and works well for around 1 Gb data but for huge around 5 Gb again it takes huge huge time – G Naik May 28 '23 at 16:44
@GNaik How long does it take? How does that time compare to using a different tool to download the 5Gb file? Your internet connection may just be slow. – Nick ODell May 28 '23 at 16:53
Hi Nick, for 1gb data it takes around one hour but for 5 Gb n more data it continues for more then 6 hours without completing the entire download – G Naik May 28 '23 at 17:43
@GNaik Many ISPs slow down your connection when you download more data. What you've described seems like it could be caused by that. That's why I asked if you'd tried this using another download tool, like wget or a browser. – Nick ODell May 28 '23 at 17:48

Downloading Huge data more then 1gb using requests lib and saving it to a file using python

2 Answers2