sklearn: fitting RandomForestClassifier or normilize data with chunks of data

Question

I have a big file with training data. I am worry, when I use this code:

clf = RandomForestClassifier()
for chunk in reader:
    clf.fit(chunk, target)

Do clf will produce model for all chunk or only for current? For incremental learning should I use only Classifiers with partial_fit() method? How I should normalize train data (build normalizer for whole data, neither only current chunk) in that way?

In scikit, new call to fit()` will forget about the previous `fit()`. So it will make a model only for last `fit()`. You need `partial_fit()`. http://scikit-learn.org/stable/modules/scaling_strategies.html#incremental-learning — Vivek Kumar, Jun 23 '17 at 05:56
@VivekKumar Does this apply to the tfidf vectorizer class too? — cs95, Jun 23 '17 at 05:59
@Coldspeed No. If data is that much large, its [advised by the sklearn authors](https://stackoverflow.com/a/17536682/3374996) to use HashingVectorizer instead. — Vivek Kumar, Jun 23 '17 at 06:03
It is not so easy question, there is warm_start parameter in constructor, but when you fit new chunk of data "UserWarning: Warm-start fitting without increasing n_estimators does not fit new trees" is thrown. And it is possible to fit different forest and then append their estimator https://stackoverflow.com/questions/28489667/combining-random-forest-models-in-scikit-learn but I'm not sure that will work fine. Does anybody tried it? — LinearLeopard, Jun 23 '17 at 18:32

score 1 · Answer 1 · answered Jun 23 '17 at 13:16

Yes, for incremental learning you can only use classifiers which implement partial_fit.

StandardScaler has partial_fit method, so it can be applied online. I'm not sure though if that's the right way to do it, as transformation will change over time. If you don't expect data distribution to change much, you can also fit any scaler on a subset of data and use it later.

Also note that RandomForestClassifier (like all tree-based classifiers) is scale invariant, so it is not clear standartization has any effect for it.

score 1 · Answer 2 · edited Nov 16 '21 at 02:07

1

partial_fit() method is not implemented for RandomForestClassifier which supports incremental learning on chunks of data.

However, you can combine the trainings of RandomForestClassifier as mentioned here using estimators_ and n_estimators.

edited Nov 16 '21 at 02:07

tdy

36,675
19
86
83

answered Nov 15 '21 at 15:09

Akanksha Jain

11
3

maxymoo · Answer 3 · 2017-06-23T06:06:11.593

0

Yes this will only work for classifiers with partial_fit; depending on how you normalise you may be able to do this chunk-by-chunk (e.g. scaling by a fixed factor or doing a label encoding).

edited Jun 23 '17 at 06:06

answered Jun 23 '17 at 05:56

maxymoo

35,286
11
92
119

Could you please write more about that "scaling by a fixed factor or doing a label encoding"? Some short example or url to read? – LinearLeopard Jun 23 '17 at 06:01
I train model on non-text data – LinearLeopard Jun 23 '17 at 06:04

sklearn: fitting RandomForestClassifier or normilize data with chunks of data

3 Answers3