0

I am working with a large data set (1.7 million samples) and am trying to train a classifier. I've had good results with scikit-learn's RandomForestClassifier, and now I want to perform permutation feature analysis on my classifier with scikit-learn's permutation_importance method. To do so I've put my preprocessing into a ColumnTransformer, and placed it into a Pipeline along with my classifier.

I have been doing my preprocessing and training my classifier as separate steps, and it takes about 6 seconds to train an individual tree. However, when set up in a pipeline, the same classifier takes about 5 minutes to train a tree. I'm using a very large number of trees (and please, I'm not looking for feedback on that), and this is not really practical for me to work with.

I've dug around on the web extensively, but apparently can't find the right combination of search terms. Here is the relevant code:

preprocessing = ColumnTransformer(
    [
        ('binary', OneHotEncoder(drop='if_binary'),
         binary_category_cols),
        ('complete', OneHotEncoder(),
         complete_category_cols),
        ('inc_cat', OneHotEncoder(drop=[-9]*len(incomplete_category_cols)),
         incomplete_category_cols),
        ('inc_ord', OneHotEncoder(drop=[-9]*len(incomplete_ordinal_cols)),
         incomplete_ordinal_cols),
        ('cmp_ord', OrdinalEncoder(),
         complete_ordinal_cols)
    ], verbose=True, n_jobs=10
)

rf_clf = RandomForestClassifier(
    n_jobs=10, verbose=3,
    n_estimators=1000, criterion='entropy',
    max_depth=None, max_features='sqrt',
    max_samples=None, class_weight='balanced_subsample'
)

clf = Pipeline(
    [
        ('preprocess', preprocessing),
        ('classifier', rf_clf)
    ]
)

X_train, X_test, y_train, y_test = \
    train_test_split(features, targets, test_size=0.1)

clf.fit(X_train, y_train)

'features' is a pandas DataFrame, and targets is a pandas Series. I can comment out the pipeline, fit and transform X_train with the ColumnTransformer, and then fit the RandomForestClassifier, and trees are built at about 50 times the speed.

I've tried feeding both configurations much smaller chunks of my data, and the difference is far less, so maybe there are limitations to Pipeline with respect to large data sets? If I set max_samples to 100,000, performance is manageable, but performance goes down substantially.

Any help would be much appreciated. I've been banging my head on my desk for about four hours on a stumbling block I had in no way anticipated.

Denis
  • 33
  • 1
  • 5

1 Answers1

0

Given that you are setting n_jobs to 10, my guess is that you are trying to do multiprocessing. In that case, you might want to consider running your script within a __name__ == "__main__" block.

See: https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py For an example pipeline that does hyperparameter search, but that uses the Pipeline object you are using, and explicitly sets n_jobs.

Also: Python scikit learn n_jobs

At least in my case, it has allowed me to decrease runtime when doing hyperparameter search for RandomForestClassifier.

  • It's been quite a while since I did that project, and I was doing multiprocessing. However, I was having no problems with multiprocessing working except when I put a process into a pipeline as described there. I may or may not have had it running in an "if \_\_name\_\_ == '\_\_main\_\_': method()" configuration at the time, but I've run multiprocessed scikit-learn models both ways without issues - don't see how that would be related. Regardless, I've moved on from scikit-learn since. It's a great starting point for machine learning but it's flexibility and customization is limited. – Denis Feb 20 '22 at 15:57