6

Situation: The file to be downloaded is a large file (>100MB). It takes quite some time, especially with slow internet connection.

Problem: However, I just need the file header (the first 512 bytes), which will decide if the whole file needs to be downloaded or not.

Question: Is there a way to do download only the first 512 bytes of a file?

Additional information: Currently the download is done using urllib.urlretrieve in Python2.7

Timothy Wong
  • 689
  • 3
  • 9
  • 28

2 Answers2

2

I think curl and head would work better than a Python solution here:

curl https://my.website.com/file.txt | head -c 512 > header.txt

EDIT: Also, if you absolutely must have it in a Python script, you can use subprocess to perform the curl piped to head command execution

EDIT 2: For a fully Python solution: The urlopen function (urllib2.urlopen in Python 2, and urllib.request.urlopen in Python 3) returns a file-like stream that you can use the read function on, which allows you to specify a number of bytes. For example, urllib2.urlopen(my_url).read(512) will return the first 512 bytes of my_url

Niema Moshiri
  • 909
  • 5
  • 14
  • Ah yes. The edit was what I needed. But no Python modules can do this? – Timothy Wong Jan 15 '18 at 06:43
  • 3
    The `urlopen` function (`urllib2.urlopen` in Python 2, and `urllib.request.urlopen` in Python 3) returns a file-like stream that you can use the `read` function on, which allows you to specify a number of bytes. For example, `urllib2.urlopen(my_url).read(512)` will return the first 512 bytes of `my_url`. However, I'm not certain this will *only* download 512 bytes, or if it will try to download the entire file behind-the-scenes and just return the first 512 – Niema Moshiri Jan 15 '18 at 06:47
  • the one in the comment works. do you want to replace it and let me accept as answer? – Timothy Wong Jan 15 '18 at 07:02
  • Might I add on that `urllib` also has the same module. If you choose to lessen the number of libraries you are importing. (I have imported `urllib` and was actually hesitant to import `urllib2`) – Timothy Wong Jan 15 '18 at 15:14
0

If the url you are trying to read responds with Content-Length header, then you can get the file size with urllib2 in Python 2.

def get_file_size(url):
    request = urllib2.Request(url)
    request.get_method = lambda : 'HEAD'
    response = urllib2.urlopen(request)
    length = response.headers.getheader("Content-Length")
    return int(length)

The function can be called to get the length and compared with some threshold value to decide whether to download or not.

if get_file_size("http://stackoverflow.com") < 1000000:
    # Download

(Note that the Python 3 implimentation differs slightly:)

from urllib import request

def get_file_size(url):
    r = request.Request(url)
    r.get_method = lambda : 'HEAD'
    response = request.urlopen(r)
    length = response.getheader("Content-Length")
    return int(length)
Simon Streicher
  • 2,638
  • 1
  • 26
  • 30
Ilayaraja
  • 2,388
  • 1
  • 9
  • 9
  • Love the idea, but I need to compare its hash values that is the one present in the file header. The file size can be the same but its contents may be different. Therefore the hash value is more reliable as a check than file size. – Timothy Wong Jan 15 '18 at 06:56