0

I have to read the first few thousand records from a webpage that consists of millions of lines of text data. I also need a copy of this data on my own machine. I'm indifferent if this means writing to a text file or downloading the whole thing at once, and I've been trying to do the latter.

However, the page is so long that I run out of memory every time I try to request the millions of lines.

import os, urllib
os.chdir('/Users/myusername/onamac')
url="http://myurlhere.com/"
urllib.request.urlretrieve(url, 'myfilename')        

Eventually I get:

Traceback (most recent call last):
File "<ipython-input-38-0ebf43ee369f>", line 6, in <module>
 urllib.request.urlretrieve(url, 'mytweets')
File "/anaconda/lib/python3.6/urllib/request.py", line 281, in urlretrieve
 tfp.write(block)
OSError: [Errno 28] No space left on device

The data isn't just separated by lines, which is a problem; it's basically a series of dictionaries that I'd eventually want to use json loads on and read into a large table.

Other ideas I've had would be to somehow stop the urlretreive request when the file reaches a certain size (I don't really care specifically how many records I get, maybe I'd cap it at 1 GB or something and see if that's enough records). But I'm not sure how I'd use tell() or anything else when I don't see how to stop urllib.request.urlretrieve partway through.

Alex
  • 295
  • 1
  • 3
  • 9
  • Big files need to be downloaded by chunks. Have a look at: https://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py – RaphaMex Jun 09 '17 at 03:49
  • @R.Saban intriguing, thanks so much! Is it recommended to pick the chunk size equal to the amount of data that I want, or should I use a counter to keep track of how many chunks I've written? – Alex Jun 09 '17 at 03:58
  • 1
    There is no general rule about sizes: you just try and see which gives best p perfs in your case ;) – RaphaMex Jun 09 '17 at 04:03

0 Answers0