14

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?

I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.

The data is available on the webpage , link to my code , and here is the error message:

KNeighborsClassifier is used for the prediction.

Problem:

"MemoryError" occurs when loading large dataset using read_csv function. To bypass this problem temporarily, I have to restart the kernel, which then read_csv function successfully loads the file, but the same error occurs when I run the same cell again.

When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.

I tried the following:

Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.

What do you think I can do to successfully train my model without running into memory problems?

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Ji Park
  • 507
  • 1
  • 6
  • 14
  • Your code + data runs fine on my laptop. It requires approx 1.2 GB of memory. How much memory does your system have? – Sicco Jul 29 '12 at 10:33
  • 1
    Got it working using loadtxt. Even without the occurrence of the memory error, running only ~75mb of data on the algorithm takes up more than 1GB of ram... I'm not sure if I'm doing anything wrong in my code. (http://pastie.org/4354911) (ipython notebook). If it's just the algorithm that's taking this long, how do you load gigabytes of data on the algorithm without taking so long to create a model? – Ji Park Jul 29 '12 at 20:44
  • 4
    You could use an algorithm which can be trained incrementally, thereby processing only (small) parts of the data at a time. An estimator in scikit-learn is capable of doing this if it implements the `partial_fit` method. – Sicco Jul 30 '12 at 09:46

1 Answers1

12

Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).

When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.

To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).

Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.

Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:

https://github.com/scikit-learn/scikit-learn/issues/325

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • scikit-learn model isn't causing any memory exception either. Only problem now is... since the data so large, the algorithm is taking a very long time to create a model. I wish there was a way to make this much faster... – Ji Park Jul 29 '12 at 20:23
  • 2
    You should try to use `KNeighborsClassifier` in bruteforce mode (instead of balltree) but then prediction times can be too slow. Alternatively you can use simple models such as `sklearn.linear_model.Perceptron`, `sklearn.naive_bayes.MultinomialNB` or `sklearn.neighbors.NearestCentroidClassifier`. Finally you can also try to train a model on a small subsample of your data to get a first quick idea of the predictive accuracy and then double the size of the dataset and iterate. – ogrisel Jul 30 '12 at 07:27