I have started working on a project wherein I need to detect trainable parameters for a given scikit-learn estimator, and if possible, to find allowable values for categorical variables (and reasonable intervals for continuous ones).
I can fetch a dictionary with parameters usingestimator.get_params()
and then set a value usingestimator.set_params(**{'var1':val1, 'var2':val2})
, and so on.
For example, for a KNN-classifier we have the following dict of params:
{'metric': 'minkowski', 'algorithm': 'auto', 'n_neighbors': 10, 'n_jobs': 1, 'p': 2, 'metric_params': None, 'weights': 'uniform', 'leaf_size': 30}
.
Now, I can using the types of the values to infer which are categorical (str
types), continuous (float
), discrete (int
) and so on. One possibly related problem is parameters for which the default is set to NoneType
, but I might just not touch these anyway, for a good reason.
The challenge now becomes to infer and define a parameter grid for use in e.g. RandomizedSearchCV
. For discrete and continuous variables the problem is tractable using e.g. a combination of try
-except
blocks together with the scipy.stats module, possible restricting the interval to lie in the "vicinity" around the default value (but at the same time being careful to not set e.g. n_jobs
to some crazy value -- that might need to be hard-coded in, or explicitly set later). If you have experience with something similar, and have some tips/tricks up your sleeve, I would love to hear about them.
But the real problem now is: how to infer for e.g. algorithm
that the allowable values actually are{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}
??
I have just started looking into the problem, and perhaps we can parse the error message we get if we try to set it to some un-allowable value? I am on the look out for good ideas here, as I want to avoid having to do this manually (I will if I have to, but it seems rather inelegant...)