sklearn and large datasets

Question

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.

I use a lot sklearn but for much smaller datasets.

In this situations the classical approach should be something like.

Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.

I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.

Now I am wondering is there an easy why to do that in sklearn? I am looking for something like

r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
     x = r.read_next_chunk(20 lines)
     m.partial_fit(x)

m.predict(new_x)

Maybe sklearn is not the right tool for these kind of things? Let me know.

See related: http://stackoverflow.com/questions/17017878/is-scikit-learn-suitable-for-big-data-tasks, depending on your task it should be possible — EdChum, May 27 '14 at 07:34
I have found some examples for situations with too many variables. But what if we have too many samples? — Donbeo, May 27 '14 at 15:56
I'm not an expert but I would think it shouldn't matter, your model is trained on the inputs and it should be just the params/weights that are stored. This is different if you have a decision tree as this would increase in size as you increase the number of params and probably sample sizes. — EdChum, May 27 '14 at 16:06
the real problem is that I can not load the csv file because it is too large — Donbeo, May 28 '14 at 18:00

score 18 · Answer 1 · answered Oct 31 '15 at 20:26

I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.

You can find the methodology, the case study and some good resources in of this post: http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/

score 17 · Accepted Answer · edited May 23 '17 at 12:10

I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.

This link may be useful... Working with big data in python and numpy, not enough ram, how to save partial results on disc?

I agree that h5py is useful but you may wish to use tools that are already in your quiver.

Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

score 3 · Answer 3 · answered Nov 28 '16 at 21:31

3

You may want to take a look at Dask or Graphlab

They are similar to pandas but working on large scale data (using out-of-core dataframes). The problem with pandas is all data has to fit into memory.

Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.

answered Nov 28 '16 at 21:31

Tuan Vu

708
7
15

So does it work with scikit-learn? Or not? Please extend your answer – Mayou36 Jun 08 '17 at 19:47
@Mayou36 I have used SFrames with scikit learn and yes they are very much compatible. I have not used Dask though. – frank Oct 01 '17 at 19:23
2

I don't believe that scikit-learn will accept a dask dataframe as input – alex Oct 20 '20 at 14:05

score 0 · Answer 4 · answered Jun 09 '14 at 13:12

I find it interesting that you have chosen to use Python for statistical analysis rather than R however, I would start by putting my data into a format that can handle such large datasets. The python h5py package is fantastic for this kind of storage - allowing very fast access to your data. You will need to chunk up your data in reasonable sizes say 1 million element chunks e.g. 20 columns x 50,000 rows writing each chunk to the H5 file. Next you need to think about what kind of model you are running - which you haven't really specified.

The fact is that you will probably have to write the algorithm for model and the machine learning cross validation because the data is large. Start by writing an algorithm to summarize the data, so that you know what you am looking at. Then once you decide what model you want to run you will need to think about what the cross validation will be. Put in a "column" into each chunk of the data set that denotes which validation set each row belongs to. You many choose to label each chunk to a particular validation set.

Next you will need to write a map reduce style algorithm to run your model on the validation subsets. The alternative is simply to run models on each chunk of each validation set and average the result (consider the theoretical validity of this approach).

Consider using spark, or R and rhdf5 or something similar. I haven't supplied any code because this is a project rather than just a simple coding question.

Using Python for data analysis instead of R is quite common. AFAIK, they are equally used nowadays and python, as it is a fully functional programming language, is often preferred by users with some programming experience. — Mayou36, Jun 08 '17 at 19:45

sklearn and large datasets

4 Answers4

Linked