2

I have checked this other answer that I found in this forum In Python, given a URL to a text file, what is the simplest way to read the contents of the text file?

And it was useful but if you take a look at my URL file here http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt

You'll notice that is tons of data going on in here. So when I use this code:

import urllib2

data =
urllib2.urlopen('http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt').read(69700) # read only 69700 chars

data = data.split("\n") # then split it into lines

for line in data:

      print line

The amount of characters that python can read with the headers in the URL file is 69700 characters, but my problem is that I need all of the data in there which is about like 30000000 characters or so.

When I put that much amount of characters I get only a chunk of the data showing up and not all of it, the headers for each one of the columns in the URL file data are gone. Help to fix this problem??

Community
  • 1
  • 1
user665997
  • 313
  • 1
  • 4
  • 18
  • The SO answer you reference shows you how to read the url line by line. Considering that you are processing line oriented data, that's most likely the way to go. You may want to just pass the urlopen object to a CSV reader and let it pull the data in. – tdelaney Oct 02 '13 at 17:16
  • "*The amount of characters that python can read with the headers in the URL file is 69700 characters*" - I disagree. Get rid of `.read(69700)` and everything will be fine. – Robᵩ Oct 02 '13 at 17:23

2 Answers2

3

What yer gonna wanna do here is read and process the data in chunks, e.g.:

import urllib2
f = urllib2.urlopen('http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt')
while True:
    next_chunk = f.read(4096) #read next 4k
    if not next_chunk: #all data has been read
        break
    process_chunk(next_chunk) #arbitrary processing
f.close()
Claudiu
  • 224,032
  • 165
  • 485
  • 680
0

The simple ways work just fine:

If you want to examine the file line by line:

for line in urllib2.urlopen('http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt'):
    # Do something, like maybe print the data:
    print line,

Or, if you want to download all of the data:

data = urllib2.urlopen('http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt')
data = data.read()
sys.stdout.write(data)
Robᵩ
  • 163,533
  • 20
  • 239
  • 308