Is there a way to stop an SKLearn fit()
function if it's taking too long? I understand the need to use a smaller dataset, tune hyperparams, etc. however that is irrelevant as I'm looking for the "quickest" regressors that run on my dataset using their default values.
Windows 10, Python 3.10.5, scikit-learn==1.1.1
Pseudo-code of what I'm looking for looks something like this:
start a timer
start sklearnRegressor.fit()
if current time - start time > X (seconds), force stop the previous fit function/simulate CTRL+C
I have created a minimal example below for testing. Flow looks like this:
- Create a mock regression dataset
- Perform Train Test split
- Gather all of SKLearn's regressors (verified by having an
__init__
) and store them in a list - Iterate through the regressors, performing
RFECV
for each, and time how long each regressor takes
Some regressors take (way) longer than others, understandably so, but what I'm trying to do is give them all a chance (assuming no prior knowledge of how long they "might" take), but I don't want my program to have to wait too long for one to finish. I've implemented multiprocessing in my real use-case, but still, the same problem remains.
Here's the code:
import time
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split
from sklearn.utils import all_estimators
from sklearn.exceptions import ConvergenceWarning
import warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)
# ------------------------------------------------------------------------------------------------ #
# MAKE REGRESSION DATASET #
# ------------------------------------------------------------------------------------------------ #
X, y = make_regression(n_samples=3000,
n_features=250,
n_informative=50,
n_targets=1,
shuffle=True,
noise=0.1,
coef=False)
# ------------------------------------------------------------------------------------------------ #
# TRAIN TEST SPLIT #
# ------------------------------------------------------------------------------------------------ #
X_train, X_test, y_train, y_test = train_test_split(X,
y,
shuffle=False,
test_size=0.25,
random_state=0)
# ------------------------------------------------------------------------------------------------ #
# GET ALL SKLEARN REGRESSORS #
# ------------------------------------------------------------------------------------------------ #
all_sklearn_regressors_unfiltered = all_estimators(type_filter='regressor')
all_sklearn_regressors_filtered = []
for name, RegressorClass in all_sklearn_regressors_unfiltered:
try:
reg = RegressorClass()
print("Adding regressor:", name)
all_sklearn_regressors_filtered.append(reg)
except Exception as e:
print("ERROR:", name, ":", e,", not adding it.")
# ------------------------------------------------------------------------------------------------ #
# PERFORM RFECV USING EACH REGRESSOR #
# ------------------------------------------------------------------------------------------------ #
for current_sklearn_regressor in all_sklearn_regressors_filtered:
# Start a timer(?)
start_time = time.time()
# Create a RFECV object
selector = RFECV(current_sklearn_regressor,
step=10,
cv=5,
verbose=0,
n_jobs=1)
# Fit the current regressor
results = selector.fit(X_train, y_train)
# Stop the timer
end_time = time.time()
# Report on how long the current regressor took to complete its run
print(current_sklearn_regressor, end_time - start_time, 'seconds')
# The rest of feature selection could go here, but it's irrelevant
# to this problem of stopping a run if it takes longer than X
# seconds...
Of course, this logic doesn't work as the timer only reports whenever the run is done for each regressor. I'd like to keep an eye on running time, and if after (say) 120 seconds it's still "fitting", just stop it and move on to the next one.
Short of sitting here and letting this run in its entirety for every regressor, I'd like a way to implement some kind of timer that will stop the current fitting if it's taking too long. I'm thinking maybe throw the fitting portion into a function, and then create some kind of timer decorator
? Maybe something like:
# ------------------------------------------------------------------------------------------------ #
# PERFORM RFECV USING EACH REGRESSOR #
# ------------------------------------------------------------------------------------------------ #
@fit_time_decorator
def iterate_through_regressors():
for current_sklearn_regressor in all_sklearn_regressors_filtered:
# Start a timer(?)
start_time = time.time()
# Create a RFECV object
selector = RFECV(current_sklearn_regressor,
step=10,
cv=5,
scoring='absolute_error',
verbose=0,
n_jobs=1)
# Fit the current regressor
results = selector.fit(X_train, y_train)
# Stop the timer
end_time = time.time()
# Report on how long the current regressor took to complete its run
print(current_sklearn_regressor, end_time - start_time, 'seconds')
# The rest of feature selection could go here, but it's irrelevant
# to this problem of stopping a run if it takes longer than X
# seconds...
return
...where the decorator would monitor how long the fit
function is running for, then auto CTRL+C to stop the fit, and continue to the next regressor? Possible?
(Not really looking to explain WHY I'm doing this, it's just a demo, just looking for some kind of solution to this exact problem as I've described. Let me know if further clarification is needed, will respond within 1 day)