4

I have started working on a project wherein I need to detect trainable parameters for a given scikit-learn estimator, and if possible, to find allowable values for categorical variables (and reasonable intervals for continuous ones).

I can fetch a dictionary with parameters usingestimator.get_params()and then set a value usingestimator.set_params(**{'var1':val1, 'var2':val2}), and so on.

For example, for a KNN-classifier we have the following dict of params: {'metric': 'minkowski', 'algorithm': 'auto', 'n_neighbors': 10, 'n_jobs': 1, 'p': 2, 'metric_params': None, 'weights': 'uniform', 'leaf_size': 30}.

Now, I can using the types of the values to infer which are categorical (str types), continuous (float), discrete (int) and so on. One possibly related problem is parameters for which the default is set to NoneType, but I might just not touch these anyway, for a good reason.

The challenge now becomes to infer and define a parameter grid for use in e.g. RandomizedSearchCV. For discrete and continuous variables the problem is tractable using e.g. a combination of try-except blocks together with the scipy.stats module, possible restricting the interval to lie in the "vicinity" around the default value (but at the same time being careful to not set e.g. n_jobs to some crazy value -- that might need to be hard-coded in, or explicitly set later). If you have experience with something similar, and have some tips/tricks up your sleeve, I would love to hear about them.

But the real problem now is: how to infer for e.g. algorithm that the allowable values actually are{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}??

I have just started looking into the problem, and perhaps we can parse the error message we get if we try to set it to some un-allowable value? I am on the look out for good ideas here, as I want to avoid having to do this manually (I will if I have to, but it seems rather inelegant...)

Zoe
  • 27,060
  • 21
  • 118
  • 148
Magnus
  • 371
  • 1
  • 14
  • Note to self: This might be a very difficult/unsolvable problem. I've poked around in the api and source code, and looked at how e.g. auto-sklearn solves this. It seems that a manual (hard coded) solution is the way to go for now. – Magnus Jun 30 '17 at 09:49
  • Interesting problem you've got there. Aside from [parsing the signature and default parameters](https://stackoverflow.com/questions/2677185/how-can-i-read-a-functions-signature-including-default-argument-values) I guess I would try parsing scikit-learn's docstring like [this](https://stackoverflow.com/questions/713138/getting-the-docstring-from-a-function). Another thing to try would be to parse the stringified function, e.g. `__init__` of the estimator but that is a - messy- long shot since I don't see any checks being done there&there is a whole hierarchy you might have to look at. – mkaran Jun 30 '17 at 10:20
  • Hello! Glad you find the theme interesting. Yes, that was/is one of the options I considered/am considering (parsing the doc). But what worries me, is consistency in the way the docstrings are written, and there are no enforced conventions (but I may be wrong though) that may be taken advantage of. I might just spend a little bit of time implementing a parser and test it on a bunch of docstrings... – Magnus Jun 30 '17 at 10:44
  • Yes, after looking at some of the docstrings I realize that unfortunately it won't be an easy task. There is some consistency but not enough to make this easy. Good luck! Let us know how this works out! – mkaran Jun 30 '17 at 10:50
  • Thanks, I'll keep this thread open and report back any progress. Have a good weekend! – Magnus Jun 30 '17 at 10:54
  • Thanks! You too! – mkaran Jun 30 '17 at 12:01

2 Answers2

0

I found a solution to the particular example I was looking at, however, it doesn't generalize well to other doc-strings as there is no set convention wrt how they are written for each estimator in sklearn.

Therefore, I post my "solution" so that others can take over and possibly improve on it. See the following snippet:

import re
from pprint import pprint 
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
doc = knn.__doc__ # Get the doc string
#from sklearn.svm import SVC
#svc = SVC()
#doc = svc.__doc__
pattern = "([a-zA-Z_]+\s:\s)|(-\s*)'([a-zA-Z_]+)'" # Define search pattern
re.compile(pattern)
matches = re.findall(pattern, doc)

clf_params = {}
previous_param = ''
for param, _, value in matches:
    if ":" in param and param[-4]!="_": # 'Hack-y'
        if param not in clf_params.keys():
            clf_params[param] = list()
            previous_param = param
        else:
            if len(value)>0:
                clf_params[previous_param].append(value)
pprint(clf_params)

This snippet prints

{'algorithm : ': ['ball_tree', 'kd_tree', 'brute', 'auto'],
 'leaf_size : ': [],
 'metric : ': [],
 'metric_params : ': [],
 'n_jobs : ': [],
 'n_neighbors : ': [],
 'p : ': [],
 'weights : ': ['uniform', 'distance']}

Which is correct.

However, if we repeat the same procedure for SVC().__doc__ we will see that it fails.

I hope somebody finds this somewhat useful.

Magnus
  • 371
  • 1
  • 14
  • This is really hack-y, and it's sad I find this nearly three years as barren as it was in '17. Bummer. – Vaidøtas I. Mar 11 '20 at 21:53
  • Ok, here's my attempt to get all this from the docstring: str(Algorithm().__doc__).split('Parameters\n ----------\n')[1].split('\n\n Attributes\n')[0].replace('\n ', '\n') This does not create a dictionary, but is simple enough to extract just the explained "Parameters" section from the docstring, which has all of the params explained and have all their listed possible/expected/accepted value inputs, which are nicely indented by one, tab, and now all that is left is to get just the indented rows from this string, which I am sure we can manage. – Vaidøtas I. Mar 11 '20 at 22:48
0

My attempt to get all this from the docstring (LinearSVC as the example algorithm), which was aided greatly by splitlines():

liner = str(LinearSVC().__doc__).split('Parameters\n    ----------\n')[1].split('\n\n    Attributes\n')[0].replace('\n        ', '\n').splitlines()

This does not create a dictionary, but is simple enough to extract just the explained "Parameters" section from the docstring, which has all of the params explained and have all their listed possible/expected/accepted value inputs, which are nicely indented by one, tab, and now we can use a simple loop with a conditional, using " : " as our anchor to identify the lines of possible/expected/accepted value inputs:

for i in liner:
   ...:     if " : " in i: #<<< the key is to use " : " as our anchor
   ...:         print(i)

The end result, prints out to:

    penalty : str, 'l1' or 'l2' (default='l2')
    loss : str, 'hinge' or 'squared_hinge' (default='squared_hinge')
    dual : bool, (default=True)
    tol : float, optional (default=1e-4)
    C : float, optional (default=1.0)
    multi_class : str, 'ovr' or 'crammer_singer' (default='ovr')
    fit_intercept : bool, optional (default=True)
    intercept_scaling : float, optional (default=1)
    class_weight : {dict, 'balanced'}, optional
    verbose : int, (default=0)
    random_state : int, RandomState instance or None, optional (default=None)
    max_iter : int, (default=1000)

So glad I can share, and if anyone else needs the full docstring parameter printout, just use:

print(str(LinearSVC().__doc__).split('Parameters\n    ----------\n')[1].split('\n\n    Attributes\n')[0].replace('\n        ', '\n'))

EDIT: If this is not intended to print out - the best way to have it as a string object is using a list comprehension, but it requires some ugly replaces, because there is extensive notation in the docstring:

docstring_short = str([i for i in liner.splitlines() if " : " in i]).replace('["    ', '').replace('    ', ',\n').replace('", "', '').replace('", \'', '').replace("', '", '').replace("', \"", '').replace(']', '')
Zoe
  • 27,060
  • 21
  • 118
  • 148
Vaidøtas I.
  • 544
  • 7
  • 23