13

I just started using MLFlow and I am happy with what it can do. However, I cannot find a way to log different runs in a GridSearchCV from scikit learn.

For example, I can do this manually

params = ['l1', 'l2']
for param in params:
    with mlflow.start_run(experiment_id=1):
        clf = LogisticRegression(penalty = param).fit(X_train, y_train)
        y_predictions = clf.predict(X_test)

        precision = precision_score(y_test, y_predictions)
        recall = recall_score(y_test, y_predictions)
        f1 = f1_score(y_test, y_predictions)

        mlflow.log_param("penalty", param)
        mlflow.log_metric("Precision", precision)
        mlflow.log_metric("Recall", recall)
        mlflow.log_metric("F1", f1)

        mlflow.sklearn.log_model(clf, "model")

But when I want to use the GridSearchCV like that

pipe = Pipeline([('classifier' , RandomForestClassifier())])

param_grid = [
    {'classifier' : [LogisticRegression()],
     'classifier__penalty' : ['l1', 'l2'],
    'classifier__C' : np.logspace(-4, 4, 20),
    'classifier__solver' : ['liblinear']},
    {'classifier' : [RandomForestClassifier()],
    'classifier__n_estimators' : list(range(10,101,10)),
    'classifier__max_features' : list(range(6,32,5))}
]


clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

best_clf = clf.fit(X_train, y_train)

I cannot think of any way to log all the individual models that the GridSearch tests. Is there any way to do it or I have to keep using the manual process?

Tasos
  • 7,325
  • 18
  • 83
  • 176

3 Answers3

6

I'd recommend hyperopt instead of scikit-learn's GridSearchCV. Hyperopt can search the space with Bayesian optimization using hyperopt.tpe.suggest. It will arrive at good parameters faster than a grid search and you can limit the number of iterations no matter the space size, so it's definitely better for large spaces. Since you're interested in the artifacts from the individual runs, you may prefer hyperopt's random search, which still has the advantage of being able to choose how many runs you perform.

You can parallelize the search very easily with Spark using hyperopt.SparkTrials (here's a more complete example). Note that you can keep using scikit's cross validation, just put it inside the objective function (you can even keep track of the variance of the cross validation using loss_variance).

Now, to actually answer the question, I believe you can log the model, parameters, metrics, or whatever inside the objective function that you pass to hyperopt.fmin. MLFlow will store each run as a child of the main run, and each run can have its own artifacts.

So you want something like this:

def objective(params):
    metrics = ...
    classifier = SomeClassifier(**params)
    cv = cross_validate(classifier, X_train, y_train, scoring = metrics)
    scores = {metric: cv[f'test_{metric}'] for metric in metrics}
    # log all the stuff here
    mlflow.log_metric('...', scores[...])
    mlflow.sklearn.log_model(classifier.fit(X_train, y_train))
    return scores['some_loss'].mean()

space = hp.choice(...)
trials = SparkTrials(parallelism = ...)
with mlflow.start_run() as run:
    best_result = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 100, trials = trials)
l_l_l_l_l_l_l_l
  • 528
  • 2
  • 8
  • 1
    thanks for sharing this answer, it has given me some great ideas/options to explore to solve the problem, I do wish I could neatly hook in MLflow into sklearn's GridSearchCV :( – janh Dec 22 '21 at 02:18
  • 1
    Hyperopt really doesn't seem to do it for me, it's super unreliable. I'm running it and there's a high chance it will just hang and never even complete optimization at all. Other times it finishes in 50 seconds with the same exact code on the same exact environment. Weird. – lte__ Jan 31 '23 at 13:54
3

I agree with the other answer that using hyperopt would be ideal to log experiments with MLFlow, especially in a Spark environment. One way to log individual model fits within GridSearchCV would be to extend the sklearn estimator’s fit method and pass a callback function to GridSearchCV’s fit.

Any parameter passed to GridSearchCV’s fit is cascaded down to the fit method of the estimators within GridSearchCV. This allows us to pass a logger function to store parameters, metrics, models etc. with MLFlow.

Here is an example with RandomForestClassifier as the estimator, however this approach should work with any other estimator as well:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


class CustomRandomForestClassifier(RandomForestClassifier):
    '''
    A custom random forest classifier.
    The RandomForestClassifier class is extended by adding a callback function within its fit method.
    '''
    def fit(self, X, y, **kwargs):
        super().fit(X, y)
        # if a "callback" key is passed, call the "callback" function by passing the fitted estimator
        if 'callback' in kwargs: 
            kwargs['callback'](self)
        return self


class Logger:
    '''
    Logger class stores the test dataset,
    and logs sklearn random forest estimator in rf_logger method.
    '''
    def __init__(self, test_X, test_y):
        self.test_X = test_X
        self.test_y = test_y

    def rf_logger(self, model):
        # log the random forest model in nested mlflow runs
        with mlflow.start_run(nested=True):
            mlflow.log_param("n_estimators", model.n_estimators)
            mlflow.log_param("max_leaf_nodes", model.max_leaf_nodes)
            mlflow.log_metric("score", model.score(self.test_X, self.test_y))
            mlflow.sklearn.log_model(model, 'rf_model')
        return None


crf = CustomRandomForestClassifier(random_state=9)
param_grid = {
    'n_estimators': [10,20],
    'max_leaf_nodes': [25,50]
}


# Use custom random forest classifier while defining the estimator for grid search 
grid = GridSearchCV(crf, param_grid, cv=2, refit=True)


# Instantiate Logger with test dataset
logger = Logger(test_X, test_y)


# start outer mlflow run and perform grid search with cross-validation
with mlflow.start_run(run_name = "grid_search"):
    # while calling GridSearchCV object's fit method pass logger.rf_logger
    # logger.rf_logger takes care of logging each fitted model during gridsearch
    grid.fit(train_X, train_y, callback = logger.rf_logger)

    # log the best estimator fround by grid search in the outer mlflow run 
    mlflow.log_param("n_estimators", grid.best_params_['n_estimators'])
    mlflow.log_param("max_leaf_nodes", grid.best_params_['max_leaf_nodes'])
    mlflow.log_metric("score", grid.score(test_X, test_y))
    mlflow.sklearn.log_model(grid.best_estimator_, 'best_rf_model')
Teilnehmer
  • 41
  • 3
1

I've just been trying to do the same thing and stumbled upon this thread.

My solution, if you just need to log the results of each model in the GridSearch, rather than monitor them as they come, is to add the following code after the last line of your example code.

This code takes the results of the cross-validation (i.e., the parameters and performance of each of the tested models, and loops through them, logging the results with MLFlow. This is just a demonstration of it, but you could also set it up to track each CV fold, and log the time taken etc.

# Extract the results of the random search
results = pd.DataFrame(best_clf.cv_results_)
# Get params we care about
results = results[['params', 'mean_test_score', 'std_test_score']]
# convert to array for iterating
results = results.values

# Loop through each experiment result and save the results to a
# nested experiement within the experiment
for some_run in results:
    with mlflow.start_run(experiment_id=1):
        # Log model configuration/params
        mlflow.log_params(some_run[0])
        # Log metrics
        metrics = {
            'accuracy': some_run[1],
            'std': some_run[2]
            }
        mlflow.log_metrics(metrics)
    mlflow.end_run()
Téo
  • 191
  • 3