1

I have a big file with training data. I am worry, when I use this code:

clf = RandomForestClassifier()
for chunk in reader:
    clf.fit(chunk, target)

Do clf will produce model for all chunk or only for current? For incremental learning should I use only Classifiers with partial_fit() method? How I should normalize train data (build normalizer for whole data, neither only current chunk) in that way?

LinearLeopard
  • 728
  • 1
  • 6
  • 18
  • In scikit, new call to fit()` will forget about the previous `fit()`. So it will make a model only for last `fit()`. You need `partial_fit()`. http://scikit-learn.org/stable/modules/scaling_strategies.html#incremental-learning – Vivek Kumar Jun 23 '17 at 05:56
  • @VivekKumar Does this apply to the tfidf vectorizer class too? – cs95 Jun 23 '17 at 05:59
  • @Coldspeed No. If data is that much large, its [advised by the sklearn authors](https://stackoverflow.com/a/17536682/3374996) to use HashingVectorizer instead. – Vivek Kumar Jun 23 '17 at 06:03
  • It is not so easy question, there is warm_start parameter in constructor, but when you fit new chunk of data "UserWarning: Warm-start fitting without increasing n_estimators does not fit new trees" is thrown. And it is possible to fit different forest and then append their estimator https://stackoverflow.com/questions/28489667/combining-random-forest-models-in-scikit-learn but I'm not sure that will work fine. Does anybody tried it? – LinearLeopard Jun 23 '17 at 18:32

3 Answers3

1

Yes, for incremental learning you can only use classifiers which implement partial_fit.

StandardScaler has partial_fit method, so it can be applied online. I'm not sure though if that's the right way to do it, as transformation will change over time. If you don't expect data distribution to change much, you can also fit any scaler on a subset of data and use it later.

Also note that RandomForestClassifier (like all tree-based classifiers) is scale invariant, so it is not clear standartization has any effect for it.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
1

partial_fit() method is not implemented for RandomForestClassifier which supports incremental learning on chunks of data.

However, you can combine the trainings of RandomForestClassifier as mentioned here using estimators_ and n_estimators.

tdy
  • 36,675
  • 19
  • 86
  • 83
0

Yes this will only work for classifiers with partial_fit; depending on how you normalise you may be able to do this chunk-by-chunk (e.g. scaling by a fixed factor or doing a label encoding).

maxymoo
  • 35,286
  • 11
  • 92
  • 119