How does urllib.urlopen() work?

Question

Let's consider a big file (~100MB). Let's consider that the file is line-based (a text file, with relatively short line ~80 chars). If I use built-in open()/file() the file will be loaded in lazy manner. I.E. if a I do aFile.readline() only a chunk of a file will reside in memory. Does the urllib.urlopen() do something similar (with usage of a cache on disk)?

How big is the difference in performance between urllib.urlopen().readline() and file().readline()? Let's consider that file is located on localhost. Once I open it with urllib.urlopen() and then with file(). How big will be difference in performance/memory consumption when i loop over the file with readline()?

What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?

score 2 · Answer 1 · answered Aug 26 '11 at 05:48

open (or file) and urllib.urlopen look like they're more or less doing the same thing there. urllib.urlopen is (basically) creating a socket._socketobject and then invoking the makefile method (contents of that method included below)

def makefile(self, mode='r', bufsize=-1):
    """makefile([mode[, bufsize]]) -> file object

    Return a regular file object corresponding to the socket.  The mode
    and bufsize arguments are as for the built-in open() function."""
    return _fileobject(self._sock, mode, bufsize)

Ferdinand Beyer · Accepted Answer · 2011-09-06T06:42:20.723

Does the urllib.urlopen() do something similar (with usage of a cache on disk)?

The operating system does. When you use a networking API such as urllib, the operating system and the network card will do the low-level work of splitting data into small packets that are sent over the network, and to receive incoming packets. Those are stored in a cache, so that the application can abstract away the packet concept and pretend it would send and receive continuous streams of data.

How big is the difference in performance between urllib.urlopen().readline() and file().readline()?

It is hard to compare these two. For urllib, this depends on the speed of the network, as well as the speed of the server. Even for local servers, there is some abstraction overhead, so that, usually, it is slower to read from the networking API than from a file directly.

For actual performance comparisons, you will have to write a test script and do the measurement. However, why do you even bother? You cannot replace one with another since they serve different purposes.

What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?

Since the bottle neck is the networking speed, it might be a good idea to process the data as soon as you get it. This way, the operating system can cache more incoming data "in the background".

It makes no sense to cache lines in a list before processing them. Your program will just sit there waiting for enough data to arrive while it could be doing something useful already.

How does urllib.urlopen() work?

2 Answers2