2

I have a large dataset with 50k rows and 10k columns. I am trying to fit this data using classifiers in auto-sklearn. Due to limited resources, I have partitioned the data into batches and intend to use incremental learning. Is it possible to use the autosklearn.classification.AutoSklearnClassifier.fit() on first batch followed by autosklearn.classification.AutoSklearnClassifier.refit() on the rest of the batches? The API documentation says:

refit(X, y)

Refit all models found with fit to new data. Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model. Parameters:
X : array-like or sparse matrix of shape = [n_samples, n_features] The training input samples. y : array-like, shape = [n_samples] or [n_samples, n_outputs] The targets.

Does this mean refit is valid only when cross validation is used on the original data or does the first line mean that subsequent batches of data can be re-trained on the same model?

Any ideas/thoughts?

piman314
  • 5,285
  • 23
  • 35
Anand
  • 71
  • 6

1 Answers1

1

refit is only used to fit an estimator on a training set after cross-validation has been performed. The method that you are after is partial_fit for example you can use this method with a SGDRegressor, docs are here

piman314
  • 5,285
  • 23
  • 35
  • Thanks, but I am looking for a partial_fit function for binary classification in auto-sklearn. I was unable to find one myself. – Anand Oct 01 '18 at 18:31