1
f = open("data.csv")
f.seek(0) 
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

The above is the code I am using to read a csv file. The csv file is only about 800 MB and I am using a 64 bit system with 8GB of Ram. The file contains 100 million lines. However,not to mention to read the entire file, even reading the first 10 million lines gives me a 'MemoryError:" <- this is really the entire error message.

Could someone tell me why please? Also as a side question, could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number) but since my data is a csv file I dont know which number I should put exactly into f.seek() so that it reads exactly from 20th row.

Thank you very much.

martineau
  • 119,623
  • 25
  • 170
  • 301
nobody
  • 815
  • 1
  • 9
  • 24
  • seek only deals with byte offsets, and knows nothing about the csv sizes. if your csv rows are different sizes, you'll have to COUNT through all 20 million lines, since you can't just seek to a spot and assume that it's row X. As for the memory business, are you running a 64bit python? a 32bit python would have a practical limit of around 3gig... – Marc B May 22 '15 at 19:14
  • I'm assuming you're using 64-bit python – EdChum May 22 '15 at 19:15
  • This could be a lead to the answer. http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory – LMC May 22 '15 at 19:18
  • @MarcB Thanks for your comment. I realized I am running on a 32 bit python... but even with this it should not be a problem to read 800MB right? (well in my case it is only 1/10 size of the entire data so 80 MB) I see. If f.seek does not work, is there anyway to read directly from, say 20th million th row? I've tried to put isslice(f_reader,20000000,21000000) but it is significantly slower than reading with isslice(f_reader,0,1000000) suggesting that it probably took really long time for isslice to find the right position – nobody May 22 '15 at 19:24
  • 1
    why don't you use np.loadtxt? – Daniel May 22 '15 at 19:25
  • @LuisMuñoz I am sorry but I do not think this is what I want.. I do not want to process the data line by line but I really need to process them together – nobody May 22 '15 at 21:28

1 Answers1

1

could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number)

No, you can't (and mustn't) use f.seek() in this situation. Rather, you must read each of the first 20 million rows somehow.

The Python documentation has this recipie:

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

Using that, you would start after 20,000,000 rows thusly:

#UNTESTED
f = open("data.csv")
f_reader = csv.reader(f)
consume(f_reader, 20000000)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

or perhaps this might go faster:

#UNTESTED
f = open("data.csv")
consume(f, 20000000)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • Thank you so much. Will go and try this soon – nobody May 22 '15 at 19:27
  • I have tested this. Indeed this was notably much faster than using directly np.islice, which is quite magical as the consume essentially use "islice" too... However it is still a little bit slower than f.seek(), though of course f.seek cannot give me an accurate starting location – nobody May 22 '15 at 21:23