I am working with a large data set (1.7 million samples) and am trying to train a classifier. I've had good results with scikit-learn's RandomForestClassifier, and now I want to perform permutation feature analysis on my classifier with scikit-learn's permutation_importance method. To do so I've put my preprocessing into a ColumnTransformer, and placed it into a Pipeline along with my classifier.
I have been doing my preprocessing and training my classifier as separate steps, and it takes about 6 seconds to train an individual tree. However, when set up in a pipeline, the same classifier takes about 5 minutes to train a tree. I'm using a very large number of trees (and please, I'm not looking for feedback on that), and this is not really practical for me to work with.
I've dug around on the web extensively, but apparently can't find the right combination of search terms. Here is the relevant code:
preprocessing = ColumnTransformer(
[
('binary', OneHotEncoder(drop='if_binary'),
binary_category_cols),
('complete', OneHotEncoder(),
complete_category_cols),
('inc_cat', OneHotEncoder(drop=[-9]*len(incomplete_category_cols)),
incomplete_category_cols),
('inc_ord', OneHotEncoder(drop=[-9]*len(incomplete_ordinal_cols)),
incomplete_ordinal_cols),
('cmp_ord', OrdinalEncoder(),
complete_ordinal_cols)
], verbose=True, n_jobs=10
)
rf_clf = RandomForestClassifier(
n_jobs=10, verbose=3,
n_estimators=1000, criterion='entropy',
max_depth=None, max_features='sqrt',
max_samples=None, class_weight='balanced_subsample'
)
clf = Pipeline(
[
('preprocess', preprocessing),
('classifier', rf_clf)
]
)
X_train, X_test, y_train, y_test = \
train_test_split(features, targets, test_size=0.1)
clf.fit(X_train, y_train)
'features' is a pandas DataFrame, and targets is a pandas Series. I can comment out the pipeline, fit and transform X_train with the ColumnTransformer, and then fit the RandomForestClassifier, and trees are built at about 50 times the speed.
I've tried feeding both configurations much smaller chunks of my data, and the difference is far less, so maybe there are limitations to Pipeline with respect to large data sets? If I set max_samples to 100,000, performance is manageable, but performance goes down substantially.
Any help would be much appreciated. I've been banging my head on my desk for about four hours on a stumbling block I had in no way anticipated.