Python: saving large web page to file

Question

Let me start off by saying, I'm not new to programming but am very new to python.

I've written a program using urllib2 that requests a web page that I would then like to save to a file. The web page is about 300KB, which doesn't strike me as particularly large but seems to be enough to give me trouble, so I'm calling it 'large'. I'm using a simple call to copy directly from the object returned from urlopen into the file:

file.write(webpage.read())

but it will just sit for minutes, trying to write into the file and I eventually receive the following:

Traceback (most recent call last):
  File "program.py", line 51, in <module>
    main()
  File "program.py", line 43, in main
    f.write(webpage.read())
  File "/usr/lib/python2.7/socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib/python2.7/httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
    value.append(self._safe_read(amt))
  File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
    raise IncompleteRead(''.join(s), amt)
httplib.IncompleteRead: IncompleteRead(6384 bytes read, 1808 more expected)

I don't know why this should give the program so much grief?

EDIT |

here is how I'm retrieving the page

jar = cookielib.CookieJar()

cookie_processor = urllib2.HTTPCookieProcessor(jar);

opener = urllib2.build_opener(cookie_processor)
urllib2.install_opener(opener)

requ_login = urllib2.Request(LOGIN_PAGE,
                             data = urllib.urlencode( { 'destination' : "", 'username' : USERNAME, 'password' :  PASSWORD } ))

requ_page = urllib2.Request(WEBPAGE)    
try:
    #login
    urllib2.urlopen(requ_login)

    #get desired page
    portfolio = urllib2.urlopen(requ_page)
except urllib2.URLError as e:
    print e.code, ": ", e.reason

http://stackoverflow.com/questions/3670257/httplib-incomplete-read might be related. — Terry Li, Nov 22 '11 at 01:03
A couple of things to isolate...If you just read into an array, I assume you have the same problem without doing any file writing. Also, what if you specify a max size to the read call, like read(500000)? — TJD, Nov 22 '11 at 01:03
Maybe this can help you: http://bobrochel.blogspot.com/2010/11/bad-servers-chunked-encoding-and.html — César, Nov 22 '11 at 01:07
I get the same result from reading into an array, I also tried adding a max size to read, still no. I edited the orginal question to include the code I'm using to get the page. — Justin Smith, Nov 22 '11 at 19:05
I used the shutil.copyfileobj as suggested below, and it seems to be working now. Any idea why this is the case? — Justin Smith, Nov 22 '11 at 19:33
*I used the shutil.copyfileobj as suggested below, and it seems to be working now. Any idea why this is the case?* I'm curious as well :) — Piotr Dobrogost, May 30 '13 at 11:04

Pavel Repin · Accepted Answer · 2012-01-04T04:43:04.490

5

I'd use a handy fileobject copier function provided by shutil module. It worked on my machine :)

>>> import urllib2
>>> import shutil
>>> remote_fo = urllib2.urlopen('http://docs.python.org/library/shutil.html')
>>> with open('bigfile', 'wb') as local_fo:
...     shutil.copyfileobj(remote_fo, local_fo)
... 
>>>

UPDATE: You may want to pass the 3rd argument to copyfileobj that controls the size of internal buffer used to transfer bytes.

UPDATE2: There's nothing fancy about shutil.copyfileobj. It simply reads a chunk of bytes from source file object and writes it the destination file object repeatedly until there's nothing more to read. Here's the actual source code of it that I grabbed from inside Python standard library:

def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)

edited Jan 04 '12 at 04:43

answered Nov 22 '11 at 02:09

Pavel Repin

30,663
1
34
41

Thank you, I used did this and it seems to be working now. Do you have an idea of what copyfileobj does differently that makes it work better? – Justin Smith Nov 22 '11 at 19:33
Hi @JustinSmith, see my answer again, I've updated it with more info. `copyfileobj` doesn't do anything especially profound. Just copies bytes to the destination file one chunk at a time. – Pavel Repin Nov 22 '11 at 22:39
This still doesn't explain what's wrong with the original code. – Piotr Dobrogost Sep 23 '12 at 19:37
1

The real problem actually turned out to be that the request was timing out because of the large file size. I increased the timeout and everything worked after that – Justin Smith Feb 26 '13 at 20:59
I used this approach for [copying a remote binary file](http://stackoverflow.com/a/28306886/1497596). (For example, an image file.) – DavidRR Feb 03 '15 at 19:20

Python: saving large web page to file

EDIT |

1 Answers1

Linked