Numpy/scipy load huge sparse matrix to use in scikit-learn

Question

I have a dataset of 40,000 rows and 5000 columns of boolean values (1s and 0s) in a csv file. I cannot load this into numpy because it throws a MemoryError.

I tried loading it into a sparse matrix as was answered in this question: csv to sparse matrix in python

However this format cannot be use in scikit-learn. Is there a way to read in the csv to a sparse matrix that can in fact be used by scikit-learn?

Loading in the matrix directly to numpy is done by:

matrix = np.loadtxt('data.csv', skiprows=1, delimiter=',')

Does the `csv` file have all those 1s and 0s? Lines that are 10000 characters long (with delimter)? And 40,000 lines? — hpaulj, Oct 07 '15 at 18:41
The dense array would require 800 MB of memory, not an insane amount. If your PC is half decent and you're using 64 bit python this should fit in RAM. However `numpy.loadtxt` and `numpy.genfromtxt` are notoriously memory hungry, maybe `numpy.fromfile` could work? — , Oct 07 '15 at 18:48
@moarningsun, 370MB on disk actually. I have a good laptop with plenty of ram (16Gb) but having the 32 bit version of python might indeed cause the loading issue. Although the dataset wouldn't go over 2Gb I suppose.. — Tim, Oct 07 '15 at 19:01
@Tim - Oops, you're right, I miscalculated.. it should only require 200 Mb (191 MB) of RAM (with dtype `bool` or `(u)int8`). How did you try to load it that it gives the MemoryError? At any rate, if you're going to do machine learning, I think sooner rather than later you'll need the 64-bit Python. — , Oct 07 '15 at 20:19
You mentioned loading into a sparse matrix like in that answer to the question you linked. Is that the exact code that's causing this `MemoryError`? Like the previous comments mention, it would be helpful for us if you could provide a sample of the data you are loading, and the snippet of code you are using to load the data. It very well could be related to the Python/system you're running on, but it may also be related to the way you're loading the data. — rabbit, Oct 07 '15 at 22:13

score 1 · Answer 1 · answered Oct 07 '15 at 14:42

The answer in the question you provided yields a lil_matrix. According to the scipy docs here, you can call matrix.tocsr() to turn it into a csr_matrix. This should be useable in sklearn routines where sparse matrices are allowed. It would be more elegant to read your data directly into a csr_matrix, but for your dataset of boolean values, this should work alright.

Numpy/scipy load huge sparse matrix to use in scikit-learn

1 Answers1