0

I have a link to a file, for example: https://example.com/video.mp4, I'm trying to download video.mp4. The code I currently have (from here):

with requests.get(URL, cookies=req_cookies, stream=True) as r:
    r.raise_for_status()
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192):
            f.write(chunk)

After debugging, the program gets stuck on requests.get() while it downloads the file to RAM memory, then it copies from RAM to a file on SSD.

I want to load the request.get() response directly to a file, instead of downloading to RAM then coping to a file for the following reasons:

  1. I download large files 1GB<
  2. I don't want Python to use so much Memory
  3. I'm using threads to download multiple files in parallel, and the RAM may fill up

I don't mind to use a different library that can do it, all I need is support for loading cookies.

  • This *doesn't* download the file in memory. It only retrieves 8KB at a time and writes it to disk. have you actually tried this code? Did you check memory usage? – Panagiotis Kanavos Oct 05 '22 at 14:50
  • Does this answer your question? [Download large file in python with requests](https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests) – Panagiotis Kanavos Oct 05 '22 at 14:51
  • Yes, the link is mentioned in the question. – one_hell_of_a_guy Oct 05 '22 at 14:53
  • All the solutions in the duplicate retrieve the file in chunks and write them to disk. Even those that use `asyncio` and even those that use libraries. – Panagiotis Kanavos Oct 05 '22 at 14:53
  • As written in the question, the requests.get() download the entire file, that's why the program is stuck on it. after that it takes a few seconds and the program finishes - which means that it downloads all at once, then write it on 8192bit chunks to the file. – one_hell_of_a_guy Oct 05 '22 at 14:54
  • 1
    How did you determine that? Did you let the code run and monitor the process's RAM usage? Or did you debug it, possibly forcing *the operating system* to buffer the data already received from the server? – Panagiotis Kanavos Oct 05 '22 at 14:54
  • I did both, after monitoring the Python process used ~1GB of RAM. – one_hell_of_a_guy Oct 05 '22 at 14:57
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/248569/discussion-between-one-hell-of-a-guy-and-panagiotis-kanavos). – one_hell_of_a_guy Oct 05 '22 at 14:58
  • What? How? Extraordinary claims require extraordinary proof and you offer none. Describe the actual process you used to determine something is wrong. Preferably with an actual URL others can also use. – Panagiotis Kanavos Oct 05 '22 at 14:59
  • Besides, that question has a *lot* of answers. Have you tried one of the others? – Panagiotis Kanavos Oct 05 '22 at 14:59
  • Yes. I saw each one of the answers save the requests.get() response to memory, then does something with it. Which the exactly what I'm trying to avoid. – one_hell_of_a_guy Oct 05 '22 at 15:02
  • So either everyone else is wrong, or you are. None of them does what you claim. On the contrary [they ensure they read data from the response stream with stream=True](https://requests.readthedocs.io/en/latest/user/quickstart/#raw-response-content) instead of reading the entire response in memory, then either copy or write chunks from the network stream to the file stream – Panagiotis Kanavos Oct 05 '22 at 15:07
  • If you are right, why does my program is stuck for most of the time on requests.get(), and when this statement is done, the file is created and the data is written to it in seconds? I think that's because the data in already in memory, and I just read it from RAM memory in chunks. – one_hell_of_a_guy Oct 05 '22 at 15:14
  • Anyway, the code you posted works just fine. I used it to download a 250MB file and Python's memory usage in Task Manager increased by 400KB while downloading, going back to its original size once the download finishes. I don't know what you're trying to download, whether it takes a lot of time for the server to respond or how you measure memory usage. Until now you haven't provided any evidence there's any problem – Panagiotis Kanavos Oct 05 '22 at 15:14
  • `why does my program is stuck for most of the time on requests.get()` because your server is slow. Or because you use Fiddler, or a similar debugging proxy, which caches the entire content before responding. We can't guess what you're doing on your machine. The code you posted works fine – Panagiotis Kanavos Oct 05 '22 at 15:17
  • I just tried downloading the 5GB file from https://testfiledownload.com/. Python's memory usage under VS Code remained 17.7MB while downloading the file at 50MB/s – Panagiotis Kanavos Oct 05 '22 at 15:19
  • But why the file isn't created until the request.get finishes? Where is it saved? – one_hell_of_a_guy Oct 05 '22 at 15:23
  • Who said it's not? It's created when `open(local_filename, 'wb')` is called and filled as the download proceeds. Did you actually try to check the file's size with `dir` or `ls`? File explorers don't update their views instantaneously. Besides, if a proxy intercepts the response, you'll get the entire response at once when the proxy returns it to the application. This isn't `get`'s problem – Panagiotis Kanavos Oct 05 '22 at 15:26

1 Answers1

2

The code works fine. I just tried downloading the 5GB file from testfiledownload.com. Python's memory usage under VS Code remained 17.7MB while downloading the file at 50MB/s.

>>> import requests
>>> import time
>>> URL="http://speedtest-sgp1.digitalocean.com/5gb.test"
>>> local_filename=r"c:\projects\test.zip"
>>> time.perf_counter()
100881.9425795
>>> with requests.get(URL, stream=True) as r:
...     time.perf_counter()
...     r.raise_for_status()
...     with open(local_filename, 'wb') as f:
...         for chunk in r.iter_content(chunk_size=8192):
...             t=time.perf_counter()
...             bytes=f.write(chunk)
...             print(f"{t} {bytes}")
... 
100882.3512556
100882.3525191 8192
100882.5384427 8192
100882.5391189 8192
100882.5398293 8192
100882.5403763 8192
100882.7277222 8192
100882.728885 8192
100882.7297928 8192
100882.7306436 8192

File Explorer was slow to update the file's size but the file's properties did show that the file increased in size. So did dir and ls. File explorers in all OSs delay updates because they have to read the file metadata from the file system. That's an expensive operation so they'll try to reduce updates or wait until activity dies down before updating their views.

If get takes a lot of time to complete it may be that the server is slow to respond.

It's also possible that a debugging proxy like Fiddler is intercepting the response, downloading everything and then presenting the entire payload as a single blob to the application.

I've run into this in the past, when I thought my partial download code was working fine, only to realize Fiddler was intercepting the entire response and sending me the entire file at the end.

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236