64 bit system, 8gb of ram, a bit more than 800MB of CSV and reading with python gives memory error

Question

f = open("data.csv")
f.seek(0) 
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

The above is the code I am using to read a csv file. The csv file is only about 800 MB and I am using a 64 bit system with 8GB of Ram. The file contains 100 million lines. However,not to mention to read the entire file, even reading the first 10 million lines gives me a 'MemoryError:" <- this is really the entire error message.

Could someone tell me why please? Also as a side question, could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number) but since my data is a csv file I dont know which number I should put exactly into f.seek() so that it reads exactly from 20th row.

Thank you very much.

seek only deals with byte offsets, and knows nothing about the csv sizes. if your csv rows are different sizes, you'll have to COUNT through all 20 million lines, since you can't just seek to a spot and assume that it's row X. As for the memory business, are you running a 64bit python? a 32bit python would have a practical limit of around 3gig... — Marc B, May 22 '15 at 19:14
This could be a lead to the answer. http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory — LMC, May 22 '15 at 19:18
@MarcB Thanks for your comment. I realized I am running on a 32 bit python... but even with this it should not be a problem to read 800MB right? (well in my case it is only 1/10 size of the entire data so 80 MB) I see. If f.seek does not work, is there anyway to read directly from, say 20th million th row? I've tried to put isslice(f_reader,20000000,21000000) but it is significantly slower than reading with isslice(f_reader,0,1000000) suggesting that it probably took really long time for isslice to find the right position — nobody, May 22 '15 at 19:24
@LuisMuñoz I am sorry but I do not think this is what I want.. I do not want to process the data line by line but I really need to process them together — nobody, May 22 '15 at 21:28

score 1 · Accepted Answer · answered May 22 '15 at 19:24

could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number)

No, you can't (and mustn't) use f.seek() in this situation. Rather, you must read each of the first 20 million rows somehow.

The Python documentation has this recipie:

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

Using that, you would start after 20,000,000 rows thusly:

#UNTESTED
f = open("data.csv")
f_reader = csv.reader(f)
consume(f_reader, 20000000)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

or perhaps this might go faster:

#UNTESTED
f = open("data.csv")
consume(f, 20000000)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

I have tested this. Indeed this was notably much faster than using directly np.islice, which is quite magical as the consume essentially use "islice" too... However it is still a little bit slower than f.seek(), though of course f.seek cannot give me an accurate starting location — nobody, May 22 '15 at 21:23

64 bit system, 8gb of ram, a bit more than 800MB of CSV and reading with python gives memory error

1 Answers1

Linked