Processing a very very big data set in python - memory error

Question

I'm trying to process data obtained from a csv file using csv module in python. there are about 50 columns & 401125 rows in this. I used the following code chunk to put that data into a list

csv_file_object = csv.reader(open(r'some_path\Train.csv','rb'))
header = csv_file_object.next()
data = []
for row in csv_file_object:
    data.append(row)

I can get length of this list using len(data) & it returns 401125. I can even get each individual record by calling list indices. But when I try to get the size of the list by calling np.size(data) (I imported numpy as np) I get the following stack trace.

MemoryError Traceback (most recent call last) in () ----> 1 np.size(data)

C:\Python27\lib\site-packages\numpy\core\fromnumeric.pyc in size(a, axis) 2198 return a.size 2199 except AttributeError: -> 2200 return asarray(a).size 2201 else: 2202 try:

C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a, dtype, order) 233 234 """ --> 235 return array(a, dtype, copy=False, order=order) 236 237 def asanyarray(a, dtype=None, order=None):

MemoryError:

I can't even divide that list into a multiple parts using list indices or convert this list into a numpy array. It give this same memory error.

how can I deal with this kind of big data sample. Is there any other way to process large data sets like this one.

I'm using ipython notebook in windows 7 professional.

Can you process the file [line by line as suggested in a related answer](http://stackoverflow.com/a/6853981/1290420)? — daedalus, Jan 27 '13 at 19:49
Do you need to have all rows in memory at once? Are you just doing one-off processing on each row? An alternative would be to make a collections.NamedTuple and convert each row into one of these tuples. They use up a really minimal amount of memory. — hughdbrown, Jan 27 '13 at 19:50
more rows I have in memory at once, better. I'll try NamedTuple Thank you — maheshakya, Jan 27 '13 at 19:53
Or, if you don't need all the fields in the CSV file, only pick out the ones you need rather than adding them all to the `data`? — Simon, Jan 27 '13 at 19:55
It looks to me like you're making a Python `list`, not a numpy `ndarray`. So when you call `np.size`, your `list` doesn't have a `.size()` method, and `np.size` falls back on calling `asarray`, and ultimately it makes (at least) one entire other copy of your data in memory. 50x400k is not so big if it's numerical, and if so I'd use `np.loadtxt` instead of going via csv. If your data isn't mostly numerical, there are other solutions anyway. — DSM, Jan 27 '13 at 19:56
I kind of need all those fields & in majority of the records, values of those fields are empty, so I don't think the number of fields make a big effect on this. — maheshakya, Jan 27 '13 at 19:59
What kind of data is it then ? (if it isn't mainly numerical) Because assuming all your data records are 64bits or less your whole dataset amount for "only" 160MBytes so it shouldn't be a problem to copy it, even several times. More importantly your question isn't precise enough to be answered : you didn't tell us what you wanted to do with the data, which we have to know if you want to us to provide you with a solution. As Dougal suggested Pandas library might be a very good fit for your data (again depending on what you want to do with it) — Félix Cantournet, Jan 27 '13 at 20:17

score 12 · Accepted Answer · answered Jan 27 '13 at 20:10

As noted by @DSM in the comments, the reason you're getting a memory error is that calling np.size on a list will copy the data into an array first and then get the size.

If you don't need to work with it as a numpy array, just don't call np.size. If you do want numpy-like indexing options and so on, you have a few options.

You could use pandas, which is meant for handling big not-necessarily-numerical datasets and has some great helpers and stuff for doing so.

If you don't want to do that, you could define a numpy structure array and populate it line-by-line in the first place rather than making a list and copying into it. Something like:

fields = [('name1', str), ('name2', float), ...]
data = np.zeros((num_rows,), dtype=fields)

csv_file_object = csv.reader(open(r'some_path\Train.csv','rb'))
header = csv_file_object.next()
for i, row in enumerate(csv_file_object):
    data[i] = row

You could also define fields based on header so you don't have to manually type out all 50 column names, though you'd have to do something about specifying the data types for each.

There're about 50 fields and it's not easy to define a specific data type for them. So I used 'object' as the dtype so that any type can be inserted (a list of fields in this case) . This method did the job. I'll try pandas too. Thank you — maheshakya, Jan 28 '13 at 07:15

Processing a very very big data set in python - memory error

1 Answers1

Linked