GridSearchCV and RandomizedSearchCV in Scikit-learn 0.24.0 or above do not print progress log with n_jobs=-1

Question

In scikit-learn 0.24.0 or above when you use either GridSearchCV or RandomizedSearchCV and set n_jobs=-1, with setting any verbose number (1, 2, 3, or 100) no progress messages gets printed. However, if you use scikit-learn 0.23.2 or lower, everything works as expected and joblib prints the progress messages.

Here is a sample code you can use to repeat my experiment in Google Colab or Jupyter Notebook:

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[0.1, 1, 10]}
svc = svm.SVC()

clf = GridSearchCV(svc, parameters, scoring='accuracy', refit=True, n_jobs=-1, verbose=60)
clf.fit(iris.data, iris.target)
print('Best accuracy score: %.2f' %clf.best_score_)

Results using scikit-learn 0.23.2:

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0295s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done   2 out of  30 | elapsed:    0.0s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done   3 out of  30 | elapsed:    0.0s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done   4 out of  30 | elapsed:    0.0s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done   5 out of  30 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   6 out of  30 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   7 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   8 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   9 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  11 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  12 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  13 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  14 out of  30 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  15 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  16 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  17 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  18 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  19 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  21 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  22 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  23 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  24 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  25 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  26 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  28 out of  30 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    0.1s finished
Best accuracy score: 0.98

Results using scikit-learn 0.24.0 (tested up to v1.0.2):

Fitting 5 folds for each of 6 candidates, totaling 30 fits
Best accuracy score: 0.98

It appears to me that scikit-learn 0.24.0 or above are not sending "verbose" value to joblib and therefore, the progress is not printing when multiprocessors are used in GridSearch or RandomizedSearchCV with "loky" backend.

Any idea how to solve this issue in Google Colab or Jupyter Notebook and get the progress log printed for sklearn 0.24.0 or above?

Yes. Unfortunately, I still haven't found a solution to this problem. @jtlz2 — Ashtad, Apr 06 '22 at 13:44
I have found others have faced the same issue: https://stackoverflow.com/questions/67120754/gridsearchcv-not-showing-verbose-levels — Ashtad, Apr 16 '22 at 22:20
Also, people are referring to the same problem here: https://github.com/scikit-learn/scikit-learn/issues/22849 — Ashtad, Apr 16 '22 at 23:44
ah i was wodnering why the progress only showed up when using n_jobs = 1. I thought it was stuck with any other numbers. I am also confused about, what does "candidates" refer to ? — fatbringer, Nov 15 '22 at 02:30
"Candidates" here refers to all the possible combinations of hyperparameters (e.g., 2*3 where 2 is for two different kernels and 3 is for the three different C values). — Ashtad, Nov 16 '22 at 03:19

LaKisha David · Answer 1 · 2022-06-21T09:30:36.157

Here's a roundabout way of getting GridSearchCV behavior and having the progress print along the way in Google Colab. It would need to be adapted for RandomSearchCV behavior.

This requires creating training, validation, and test sets. We will use the validation set for testing the multiple models along and save the test set for testing the final best model.

import gc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tqdm import tqdm

from sklearn.neighbors import KernelDensity
from scipy import stats
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, ParameterGrid

# This is based on the target and features from my dataset
y = relationships["tmrca"]
X = relationships.drop(columns = ["sample1", "sample2", "total_span_cM", "max_span_cM", "relationship", "tmrca"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.25, random_state=42)
print(f"X_train size: {len(X_train):,} \nX_validation size: {len(X_validation):,} \nX_test size: {len(X_test):,}")

Here, we define the method.

def random_forest_tvt(para_grid, seed):
    # grid search for the hyperparameters like n_estimators, max_leaf_nodes, etc.
    # fit model on training set, tune paras on validation set, save best paras
    error_min = 1
    count = 0
    clf = RandomForestClassifier(n_jobs=-1, random_state=seed)
    num_fits = len(ParameterGrid(para_grid))
    with tqdm(total=num_fits, desc=f"Trying the models for the best fit...", file=sys.stdout) as fit_pbar:
        
        for g in ParameterGrid(para_grid):
            count += 1
            print(f"\n{g}")
            clf.set_params(**g)
            clf.fit(X_train, y_train)

            y_predict_validation = clf.predict(X_validation)
            accuracy_measure = accuracy_score(y_validation, y_predict_validation)
            error_validation = 1 - accuracy_measure
            print(f"The accuracy is {accuracy_measure * 100:.2f}%.\n")

            if(error_validation < error_min):
                error_min = error_validation
                best_para = g
            
            fit_pbar.update()
    
    # fitting the model on the best parameters for method output
    clf.set_params(**best_para)
    clf.fit(X_train, y_train)

    y_predict_train =  clf.predict(X_train)
    score_train = accuracy_score(y_train, y_predict_train)

    y_predict_validation =  clf.predict(X_validation)
    score_validation = accuracy_score(y_validation, y_predict_validation)

    return(best_para, score_train, score_validation)

And then we define the parameter grid and call the method.

seed = 0

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 1000, stop = 5000, num = 3)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 3)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True]
# Parameter Grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

print(f"The parameter grid\n{random_grid}\n")

best_parameters, score_train, score_validation = random_forest_tvt(random_grid, seed)
print(f"\n === Random Forest ===\n Best parameters are: {best_parameters} \n training score: {score_train * 100:.2f}%, validation error: {score_validation * 100:.2f}.")

And then here are the first 5 fit results printed as output in Google Colab while the method is still running.

The parameter grid
{'n_estimators': [1000, 3000, 5000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 60, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True]}

Trying the models for the best fit...:   0%|          | 0/216 [00:00<?, ?it/s]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 1000}
The accuracy is 85.13%.

Trying the models for the best fit...:   0%|          | 1/216 [00:16<58:27, 16.31s/it]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 3000}
The accuracy is 85.13%.

Trying the models for the best fit...:   1%|          | 2/216 [01:05<2:06:44, 35.53s/it]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 5000}
The accuracy is 85.10%.

Trying the models for the best fit...:   1%|▏         | 3/216 [02:40<3:42:34, 62.70s/it]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 1000}
The accuracy is 85.15%.

Trying the models for the best fit...:   2%|▏         | 4/216 [02:56<2:36:00, 44.15s/it]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 3000}
The accuracy is 85.14%.

Trying the models for the best fit...:   2%|▏         | 5/216 [03:43<2:39:13, 45.28s/it]

And then you can use best_paramters to do further fine-tuning or to call the predict method on the test set.

best_grid = RandomForestClassifier(n_jobs=-1, random_state=seed)
best_grid.set_params(**best_parameters)
best_grid.fit(X_train, y_train)
y_predict_test =  best_grid.predict(X_test)
score_test = accuracy_score(y_test, y_predict_test)
print(f"{score_test:.2f}%")

You would need to do further adaptions to make it do the k-fold behavior. As it stands, each model will be tested once on the train set and once on the validation set for a total of two times for each model. Then a model with the best parameters will be tested for the third time to produce the output. Finally, you can use the output parameters to do further fine-tuning (not shown here) or call the predict method on the test set.

GridSearchCV and RandomizedSearchCV in Scikit-learn 0.24.0 or above do not print progress log with n_jobs=-1

1 Answers1