2

I am downloading pdfs using python requests library by doing:

from tempfile import NamedTemporaryFile
f = NamedTemporaryFile()

response = requests.get(pdf_url)
assert response.status_code == 200 # optionally `assert response.ok`
f.write(response.content)

Every so often response.content appears to be truncated: when I do f.tell(), I see there there are less bytes than expected. The Pdf also is broken: it does not open in a pdf reader.

When I then immediately redo the same request with the same url then the full file is downloaded, and f.tell() shows the expected value, and the pdf opens in a pdf reader.

Is this a commonly known problem?

Note: I seem to have a memory leak - but this problem is happening when I am using 700MB and have 1300MB left.

NorthCat
  • 9,643
  • 16
  • 47
  • 50
Rich Tier
  • 9,021
  • 10
  • 48
  • 71
  • 1
    Did you open `f` in *binary* mode? You are missing a mode parameter here altogether (no `'w'` even). – Martijn Pieters Jun 09 '14 at 07:32
  • sorry Martjin, I realised Im using temfile library. Updated. – Rich Tier Jun 09 '14 at 07:35
  • And how are you determining that you have less data than you are getting fewer bytes than expected? – Martijn Pieters Jun 09 '14 at 07:36
  • when I see the downloaded file cannot be opened in pdf reader I download the file again. I compare the filesize of both. See the new one is longer, and it opens in pdf reader. – Rich Tier Jun 09 '14 at 07:44
  • 3
    Compare the `response.headers['Content-length']` result with the file size. Most likely *it is the server* that is sending you incomplete data. In any case, for larger (binary data) responses, it'll be more efficient to use streaming. See [How to download image using requests](http://stackoverflow.com/a/13137873) – Martijn Pieters Jun 09 '14 at 08:01

0 Answers0