0

Reading huge data from requests and saving it to a file. This works for < 1G of data but for more than 1GB to 5GB it takes a huge time and I have not seen the data saved to a file which gives me connection errors.

Piece of Code I tried:

with request.get(url....) as r:
   with open(file ,‘wb’) as f:
      for chunk in r.iter_content(chunk_size = 10000):
         if chunk:
            f.write(chunk)
            f.flush()

Any suggestions here to accelerate the download process to save it to a file will be helpful. I tried with different chunk size and commenting flush but not much improvement.

with request.get(url....) as r:
   with open(file ,‘wb’) as f:
      for chunk in r.iter_content(chunk_size = 10000):
         if chunk:
            f.write(chunk)
            f.flush()

This gives results for less than 1GB of data but for more than 1 GB of data it takes huge time and gives an error of connection from the source from where we fetched the data using requests.

Michael M.
  • 10,486
  • 9
  • 18
  • 34
G Naik
  • 9
  • 5

2 Answers2

1

I think the best approach is to do a parallel download

step 1: pip install pypdl

step 2: for downloading the file you could use

from pypdl import Downloader

dl = Downloader()
dl.start('http://example.com/file.txt', 'file.txt')

by: Jishnu

There are differents options in the source stack question

source: Downloading a large file in parts using multiple parallel threads

Griner
  • 151
  • 1
  • 8
  • Thanks Jishnu. This is something I have not tried and I will try this and see the progress for larger around 5 G-Block data – G Naik May 28 '23 at 16:45
  • Jishnu how we can give Params like auth, proxies , headers , verify = False in the downloader object ? – G Naik May 29 '23 at 07:53
  • @GNaik take a look in the pkg code, I think you can easily adapt for your use case https://github.com/m-jishnu/pypdl/blob/main/pypdl/main.py – Griner May 30 '23 at 15:03
1

As noted in the comments, you need to pass stream=True to requests.get(), or you'll end up with lots of memory use. You may be doing that already - it's not clear from your question.


if chunk:

This step isn't required - iter_content() won't give you empty chunks.


f.flush()

This is slowing your code down. It turns off buffering and tells Python to finish the previous write before it begins the next write. It's much faster to queue up as many writes as possible.

It's also not required. When the with block exits, the file is closed, which implicitly flushes the remaining writes in the file.

For those reasons, you should delete this line of code.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • Thanks Nick for the suggestions . These are already tried and works well for around 1 Gb data but for huge around 5 Gb again it takes huge huge time – G Naik May 28 '23 at 16:44
  • @GNaik How long does it take? How does that time compare to using a different tool to download the 5Gb file? Your internet connection may just be slow. – Nick ODell May 28 '23 at 16:53
  • Hi Nick, for 1gb data it takes around one hour but for 5 Gb n more data it continues for more then 6 hours without completing the entire download – G Naik May 28 '23 at 17:43
  • @GNaik Many ISPs slow down your connection when you download more data. What you've described seems like it could be caused by that. That's why I asked if you'd tried this using another download tool, like wget or a browser. – Nick ODell May 28 '23 at 17:48