I have to read the first few thousand records from a webpage that consists of millions of lines of text data. I also need a copy of this data on my own machine. I'm indifferent if this means writing to a text file or downloading the whole thing at once, and I've been trying to do the latter.
However, the page is so long that I run out of memory every time I try to request the millions of lines.
import os, urllib
os.chdir('/Users/myusername/onamac')
url="http://myurlhere.com/"
urllib.request.urlretrieve(url, 'myfilename')
Eventually I get:
Traceback (most recent call last):
File "<ipython-input-38-0ebf43ee369f>", line 6, in <module>
urllib.request.urlretrieve(url, 'mytweets')
File "/anaconda/lib/python3.6/urllib/request.py", line 281, in urlretrieve
tfp.write(block)
OSError: [Errno 28] No space left on device
The data isn't just separated by lines, which is a problem; it's basically a series of dictionaries that I'd eventually want to use json
loads on and read into a large table.
Other ideas I've had would be to somehow stop the urlretreive
request when the file reaches a certain size (I don't really care specifically how many records I get, maybe I'd cap it at 1 GB or something and see if that's enough records). But I'm not sure how I'd use tell()
or anything else when I don't see how to stop urllib.request.urlretrieve
partway through.