Stop SKLearn.fit() after certain amount of time?

Question

Is there a way to stop an SKLearn fit() function if it's taking too long? I understand the need to use a smaller dataset, tune hyperparams, etc. however that is irrelevant as I'm looking for the "quickest" regressors that run on my dataset using their default values.

Windows 10, Python 3.10.5, scikit-learn==1.1.1

Pseudo-code of what I'm looking for looks something like this:

start a timer
start sklearnRegressor.fit()
if current time - start time > X (seconds), force stop the previous fit function/simulate CTRL+C

I have created a minimal example below for testing. Flow looks like this:

Create a mock regression dataset
Perform Train Test split
Gather all of SKLearn's regressors (verified by having an __init__) and store them in a list
Iterate through the regressors, performing RFECV for each, and time how long each regressor takes

Some regressors take (way) longer than others, understandably so, but what I'm trying to do is give them all a chance (assuming no prior knowledge of how long they "might" take), but I don't want my program to have to wait too long for one to finish. I've implemented multiprocessing in my real use-case, but still, the same problem remains.

Here's the code:

import time
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split
from sklearn.utils import all_estimators
from sklearn.exceptions import ConvergenceWarning
import warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)


# ------------------------------------------------------------------------------------------------ #
#                                      MAKE REGRESSION DATASET                                     #
# ------------------------------------------------------------------------------------------------ #
X, y = make_regression(n_samples=3000,
                       n_features=250,
                       n_informative=50,
                       n_targets=1,
                       shuffle=True,
                       noise=0.1,
                       coef=False)

# ------------------------------------------------------------------------------------------------ #
#                                         TRAIN TEST SPLIT                                         #
# ------------------------------------------------------------------------------------------------ #
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    shuffle=False,
                                                    test_size=0.25,
                                                    random_state=0)

# ------------------------------------------------------------------------------------------------ #
#                                    GET ALL SKLEARN REGRESSORS                                    #
# ------------------------------------------------------------------------------------------------ #
all_sklearn_regressors_unfiltered = all_estimators(type_filter='regressor')
all_sklearn_regressors_filtered = []
for name, RegressorClass in all_sklearn_regressors_unfiltered:
    try:
        reg = RegressorClass()
        print("Adding regressor:", name)
        all_sklearn_regressors_filtered.append(reg)
    except Exception as e:
        print("ERROR:", name, ":", e,", not adding it.")

# ------------------------------------------------------------------------------------------------ #
#                                PERFORM RFECV USING EACH REGRESSOR                                #
# ------------------------------------------------------------------------------------------------ #
for current_sklearn_regressor in all_sklearn_regressors_filtered:
    # Start a timer(?)
    start_time = time.time()
    # Create a RFECV object
    selector = RFECV(current_sklearn_regressor, 
                     step=10, 
                     cv=5, 
                     verbose=0, 
                     n_jobs=1)
    # Fit the current regressor
    results = selector.fit(X_train, y_train)
    # Stop the timer
    end_time = time.time()
    # Report on how long the current regressor took to complete its run
    print(current_sklearn_regressor, end_time - start_time, 'seconds')

    # The rest of feature selection could go here, but it's irrelevant
    # to this problem of stopping a run if it takes longer than X
    # seconds...

Of course, this logic doesn't work as the timer only reports whenever the run is done for each regressor. I'd like to keep an eye on running time, and if after (say) 120 seconds it's still "fitting", just stop it and move on to the next one.

Short of sitting here and letting this run in its entirety for every regressor, I'd like a way to implement some kind of timer that will stop the current fitting if it's taking too long. I'm thinking maybe throw the fitting portion into a function, and then create some kind of timer decorator? Maybe something like:

# ------------------------------------------------------------------------------------------------ #
#                                PERFORM RFECV USING EACH REGRESSOR                                #
# ------------------------------------------------------------------------------------------------ #
@fit_time_decorator
def iterate_through_regressors():

    for current_sklearn_regressor in all_sklearn_regressors_filtered:
        # Start a timer(?)
        start_time = time.time()
        # Create a RFECV object
        selector = RFECV(current_sklearn_regressor, 
                        step=10, 
                        cv=5, 
                        scoring='absolute_error', 
                        verbose=0, 
                        n_jobs=1)
        # Fit the current regressor
        results = selector.fit(X_train, y_train)
        # Stop the timer
        end_time = time.time()
        # Report on how long the current regressor took to complete its run
        print(current_sklearn_regressor, end_time - start_time, 'seconds')

        # The rest of feature selection could go here, but it's irrelevant
        # to this problem of stopping a run if it takes longer than X
        # seconds...

    return

...where the decorator would monitor how long the fit function is running for, then auto CTRL+C to stop the fit, and continue to the next regressor? Possible?

(Not really looking to explain WHY I'm doing this, it's just a demo, just looking for some kind of solution to this exact problem as I've described. Let me know if further clarification is needed, will respond within 1 day)

Does this answer your question? [Break the function after certain time](https://stackoverflow.com/questions/25027122/break-the-function-after-certain-time) — Essam, Aug 06 '22 at 07:42
Looks like that's not available on Windows, which is what I'm running :( — wildcat89, Aug 06 '22 at 07:45
I did find this article, that I might try out first: https://blog.finxter.com/how-to-limit-the-execution-time-of-a-function-call/ — wildcat89, Aug 06 '22 at 07:48

score 2 · Answer 1 · answered Aug 06 '22 at 10:01

Got it working, thanks to this article:

# ------------------------------------------------------------------------------------------------ #
#                                PERFORM RFECV USING EACH REGRESSOR                                #
# ------------------------------------------------------------------------------------------------ #
for current_sklearn_regressor in all_sklearn_regressors_filtered:
    # Start a timer(?)
    start_time = time.time()
    # Create a RFECV object
    selector = RFECV(current_sklearn_regressor, 
                     step=10, 
                     cv=5, 
                    #  scoring='absolute_error', 
                     verbose=0, 
                     n_jobs=1)

    # Function(s) to only allow 120 seconds for each regressor
    def long_function():
        return selector.fit(X_train, y_train)
    def run_function():
        try:
            return func_timeout.func_timeout(120, long_function)
        except func_timeout.FunctionTimedOut:
            pass
        return None
    results = run_function()
    # Stop the timer
    end_time = time.time()
    # Report on how long the current regressor took to complete its run
    print(current_sklearn_regressor, end_time - start_time, 'seconds')

    # The rest of feature selection could go here, but it's irrelevant
    # to this problem of stopping a run if it takes longer than X
    # seconds...

Stop SKLearn.fit() after certain amount of time?

1 Answers1