Scikit and Pandas: Fitting Large Data

Question

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?

I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.

The data is available on the webpage , link to my code , and here is the error message:

KNeighborsClassifier is used for the prediction.

Problem:

"MemoryError" occurs when loading large dataset using read_csv function. To bypass this problem temporarily, I have to restart the kernel, which then read_csv function successfully loads the file, but the same error occurs when I run the same cell again.

When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.

I tried the following:

Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.

What do you think I can do to successfully train my model without running into memory problems?

Your code + data runs fine on my laptop. It requires approx 1.2 GB of memory. How much memory does your system have? — Sicco, Jul 29 '12 at 10:33
Got it working using loadtxt. Even without the occurrence of the memory error, running only ~75mb of data on the algorithm takes up more than 1GB of ram... I'm not sure if I'm doing anything wrong in my code. (http://pastie.org/4354911) (ipython notebook). If it's just the algorithm that's taking this long, how do you load gigabytes of data on the algorithm without taking so long to create a model? — Ji Park, Jul 29 '12 at 20:44
You could use an algorithm which can be trained incrementally, thereby processing only (small) parts of the data at a time. An estimator in scikit-learn is capable of doing this if it implements the `partial_fit` method. — Sicco, Jul 30 '12 at 09:46

ogrisel · Accepted Answer · 2012-07-30T12:59:14.030

Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).

When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.

To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).

Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.

Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:

https://github.com/scikit-learn/scikit-learn/issues/325

scikit-learn model isn't causing any memory exception either. Only problem now is... since the data so large, the algorithm is taking a very long time to create a model. I wish there was a way to make this much faster... — Ji Park, Jul 29 '12 at 20:23
You should try to use `KNeighborsClassifier` in bruteforce mode (instead of balltree) but then prediction times can be too slow. Alternatively you can use simple models such as `sklearn.linear_model.Perceptron`, `sklearn.naive_bayes.MultinomialNB` or `sklearn.neighbors.NearestCentroidClassifier`. Finally you can also try to train a model on a small subsample of your data to get a first quick idea of the predictive accuracy and then double the size of the dataset and iterate. — ogrisel, Jul 30 '12 at 07:27

Scikit and Pandas: Fitting Large Data

1 Answers1