How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier
is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv function. To bypass this problem temporarily, I have to restart the kernel, which then read_csv function successfully loads the file, but the same error occurs when I run the same cell again.
When the read_csv
function loads the file successfully, after making changes to the dataframe
, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?