1

I have a csv file of 10+gb ,i used "chunksize" parameter available in the pandas.read_csv() to read and pre-process the data,for training the model want to use one of the online learning algo.

normally cross-validation and hyper-parameter tuning is done on the entire training data set and train the model using the best hyper-parameter,but in the case of the huge data, if i do the same on the chunk of the training data how to choose the hyper-parameter?

ashwin g
  • 161
  • 2
  • 9

1 Answers1

0

I believe you are looking for online learning algorithms like the ones mentioned on this link Scaling Strategies for large datasets. You should use algorithms that support partial_fit parameter to load these large datasets in chunks. You can also look at the following links to see which one helps you the best, since you haven't specified the exact problem or the algorithm that you are working on:

EDIT : If you want to solve the class imbalance problem, you can try this : imabalanced-learn library in Python

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
  • i wanted to use algorithms MLPClassifier or SGD,my problem is how to tune the hyper-parameter in the online learning algorithm,i cannot use the grid search for each and every chunk ryt? – ashwin g Sep 26 '17 at 06:31
  • @ashwing gridSearchCV and the online implementation of SGD will take care of that itself, you don't need to tune each and every parameter manually. Read more about it here : http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html – Gambit1614 Sep 26 '17 at 06:54
  • when i am dummifing each chunk of data over iteration and train the model ,few levels in the categorical variables are missing because they are appear in the later part of the file ....is there any solution for this? – ashwin g Sep 28 '17 at 06:10
  • @ashwing you can try this : https://stackoverflow.com/a/35411686/8160718 – Gambit1614 Sep 28 '17 at 06:30