2

I'm using this piece of code to download mp3 podcasts.

req = urllib2.urlopen(item)
CHUNK = 16 * 1024
with open(local_file, 'wb') as fp:
    while True:
        chunk = req.read(CHUNK)
        if not chunk: break
        fp.write(chunk)

Which works perfectly - but I am wondering what is the optimal chunk size for best download performance?

If it makes a difference, I'm on a 6mbit adsl connection.

Stephen Angell
  • 310
  • 2
  • 13
  • This is a good question, but not really urllib2/python specific. See http://stackoverflow.com/a/2811047/98057 for a pretty good answer. Are you sure this needs to be optimized? Try benchmarking. Compare to wget:ing the file. – André Laszlo Feb 24 '15 at 12:56

2 Answers2

3

A good buffer size would be the same used by your OS kernel for the socket's buffer. This way, you don't perform more reads than you should.

On GNU/Linux, socket buffer sizes can be seen in the /proc/sys/net/core/rmem_default file (size in bytes). You may increase your socket's buffer size, using setsockopt to set the SO_RCVBUF parameter. However this size is capped by your system (/proc/sys/net/core/rmem_max) and you would need admin privileges (CAP_NET_ADMIN) to go beyond that limit.

At this point you might do something that is platform-specific for a small gain.

However it's a good idea to look at socket's options (see man 7 socket, online version) to perform micro-optimisations and learn stuff. :)

As there is no real sweet spot that always works best, you should always benchmark any tweaks to check if your changes are actually beneficial or not. Have fun!

giant_teapot
  • 707
  • 4
  • 15
  • 1
    /proc/sys/net/core/rmem_default reveals to be 212992 - way outside of what I am considering as a buffer. I did end up doing some benchmarks across the whole range of that data (from 1k to 206k). the result: nothing conclusive - it doesn't really matter what you set it too - the difference is negligible and there is no particular pattern. ho hum. worth a try anyway. – Stephen Angell Feb 25 '15 at 23:55
2

To further expand on my comment to @giant_teapot

the code I used to benchmark was...

#!/usr/bin/env python

import time
import os
import urllib2

#5mb mp3 file
testdl = "http://traffic.libsyn.com/timferriss/Arnold_5_min_-_final.mp3" 

chunkmulti = 1
numpass = 5

while (chunkmulti < 207):
    passtime = 0
    passattempt = 1
    while (passattempt <= numpass):
        start = time.time()
        req = urllib2.urlopen(testdl)
        CHUNK = chunkmulti * 1024
        with open("test.mp3", 'wb') as fp:
            while True:
                chunk = req.read(CHUNK)
                if not chunk: break
                fp.write(chunk)
        end = time.time()
        passtime += end - start
        os.remove("test.mp3")
        passattempt += 1
    print "Chunk size multiplier ", chunkmulti , " took ", passtime / passattempt, " seconds"
    chunkmulti += 1

the results weren't conclusive. Here's the first bunch of results...

Chunk size multiplier  1  took  13.9629709721  seconds
Chunk size multiplier  2  took  8.01173728704  seconds
Chunk size multiplier  3  took  10.3750542402  seconds
Chunk size multiplier  4  took  7.11076325178  seconds
Chunk size multiplier  5  took  11.3685477376  seconds
Chunk size multiplier  6  took  6.86864703894  seconds
Chunk size multiplier  7  took  14.2680369616  seconds
Chunk size multiplier  8  took  7.93746650219  seconds
Chunk size multiplier  9  took  6.81188523769  seconds
Chunk size multiplier  10  took  7.54047352076  seconds
Chunk size multiplier  11  took  6.84347498417  seconds
Chunk size multiplier  12  took  7.88792568445  seconds
Chunk size multiplier  13  took  7.37244099379  seconds
Chunk size multiplier  14  took  8.15134423971  seconds
Chunk size multiplier  15  took  7.1664044857  seconds
Chunk size multiplier  16  took  10.9474172592  seconds
Chunk size multiplier  17  took  7.23868894577  seconds
Chunk size multiplier  18  took  7.66610199213  seconds

Results continued like this up to a chunk size of 207kb

So I set the chunk size to 6kb. Might have a go at benchmarking this against wget next...

Stephen Angell
  • 310
  • 2
  • 13
  • This experiment is interesting. :) – giant_teapot Mar 01 '15 at 12:10
  • Sure the gain will always be minimal (if any), and I think you should run multiple downloads per buffer size in order to get more samples and thus more reliable results. In that matter, computing the mean and standard deviation would be most pertinent. Have fun! o/ – giant_teapot Mar 01 '15 at 12:16