Check file size before downloading when not provided in content-header

Question

I use Python 2.78 and the requests library to download a file from a HTTP server. Before downloading, I want to check the file size and do something different (e.g. aborting) when the size exceeds some given limit. I know that this can easily be checked if the server provides the attribute content-length in the header - however, the one I use doesn't.

According to this great article on exception handling with requests, checking the file size before saving to harddisk can be done by only downloading the header and then iterating over the content without actually saving the file. This approach is used in my code below.

However, I got the impression that I can only iterate over the content once (to check the file size) and then the connection gets closed. There is nothing like seek(0) or similar to reset the parser to the beginning, iterate again but this time save the file to disk. When I try this (as in my code below), I get a file of 0 kb size on my harddisk.

import requests
from contextlib import closing

# Create a custom exception.
class ResponseTooBigException(requests.RequestException):
    """The response is too big."""

# Maximum file size and download chunk size.
TOO_BIG = 1024 * 1024 * 200 # 200MB
CHUNK_SIZE = 1024 * 128

# Connect to a test server. stream=True ensures that only the header is downloaded here.
response = requests.get('http://leil.de/di/files/more/testdaten/25mb.test', stream=True)

try:

    # Iterate over the response's content without actually saving it on harddisk.
    with closing(response) as r:
        content_length = 0
        for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
            content_length = content_length + CHUNK_SIZE

            # Do not download the file if it is too big.
            if content_length > TOO_BIG:
                raise ResponseTooBigException(response=response)

            else:    
                # If the file is not too big, this code should download the response file to harddisk. However, the result is a 0kb file.
                print('File size ok. Downloading...')
                with open('downloadedFile.test', 'wb') as f:
                    for chunk in response.iter_content(chunk_size=CHUNK_SIZE): 
                        if chunk:
                            f.write(chunk)
                            f.flush()

except ResponseTooBigException as e:
    print('The HTTP response was too big (> 200MB).')

I already tried to make a copy of the response first with

import copy
response_copy = copy.copy(response)

and then use response_copy in line

with closing(response_copy) as r:

but response in line

for chunk in response.iter_content(chunk_size=CHUNK_SIZE):

to allow for too independent iterations over the response. However, this results in

AttributeError                            Traceback (most recent call last)
<ipython-input-2-3f918ff844c3> in <module>()
     35                         if chunk:
     36                             f.write(chunk)
---> 37                             f.flush()
     38 
     39 except ResponseTooBigException as e:

C:\Python34\lib\contextlib.py in __exit__(self, *exc_info)
    150         return self.thing
    151     def __exit__(self, *exc_info):
--> 152         self.thing.close()
    153 
    154 class redirect_stdout:

C:\Python34\lib\site-packages\requests\models.py in close(self)
    837         *Note: Should not normally need to be called explicitly.*
    838         """
--> 839         return self.raw.release_conn()

AttributeError: 'NoneType' object has no attribute 'release_conn'

You can't use copying here; the data hasn't been received yet and the remove HTTP server has no concept of there being two response objects; it'll still just send you the data once. — Martijn Pieters, Feb 22 '15 at 21:20
Since the whole point of streaming content is to prevent the whole response body to be held in memory at once, `requests` won't hold on to that data for you. You'll have to keep track of the downloaded data yourself; in a temporary file or in a string or list or other in-memory structure. — Martijn Pieters, Feb 22 '15 at 21:23
Okay thanks, I see. Then I would have to write the chunks e.g. to a BytesIO object and abort if the objects gets too large, otherwise copy it to a file afterwards. This is not really what streaming content is about as you said, however, it's still better than downloading a file which is too large at once and then abort. — Dirk, Feb 23 '15 at 09:42
Take a look at [`tempfile.SpooledTemporaryFile()`](https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile); it'll use disk I/O if the object gets on the large side, in-memory I/O otherwise. Perhaps a better choice than BytesIO? It'll use BytesIO under the hood. — Martijn Pieters, Feb 23 '15 at 09:45

Check file size before downloading when not provided in content-header

0 Answers0