-1

I have a data set with some float column features (X_train) and a continuous target (y_train).

I want to run KNN regression on the data set, and I want to (1) do a grid search for hyperparameter tuning and (2) run cross validation on the training.

I wrote this code:

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)

cv_method = RepeatedStratifiedKFold(n_splits=5, 
                                    n_repeats=3, 
                                    random_state=999)


# Define our candidate hyperparameters
hp_candidates = [{'n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14,15], 'weights': ['uniform','distance'],'p':[1,2,5]}]

# Search for best hyperparameters
grid = GridSearchCV(estimator=KNeighborsRegressor(), 
                      param_grid=hp_candidates, 
                      cv=cv_method,
                      verbose=1,  
                      scoring='accuracy', 
                      return_train_score=True)

grid.fit(X_train,y_train)

The error I get is:

Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

I understand the error, that I can only do this method for KNN in classification, not regression.

But what I can't find is how to edit this code to make it suitable for KNN regression? Can someone explain to me how this could be done?

(So the ultimate aim is I have a data set, I want to tune the parameters, do cross validation, and output the best model based on above and get back some accuracy scores, ideally scores that have comparable scores in other algorithms and are not specific to KNN, so I can compare accuracy).

Also just to mention, this is my first attempt at KNN in scikitlearn, so all comments/critic is welcome.

Slowat_Kela
  • 1,377
  • 2
  • 22
  • 60
  • 1
    Can you share a part of your data e.g. five arbitrary samples? Which line causes that error? Also, you are using `accuracy` as a metric for regression task but it is not good, please see this [answer](https://stackoverflow.com/a/54458777/9332187) – Mustafa Aydın Feb 28 '21 at 15:40
  • Is your problem a classification, or a regression? – Ben Reiniger Feb 28 '21 at 17:01
  • It's regression (the y_train/label is continuous). Mustafa, I can post some lines, but there's over 150 columns per row, so I'm not sure for space it's appropriate (?); but each row is ~150 float values (features), and a y label that's a float also. – Slowat_Kela Feb 28 '21 at 17:51
  • 1
    Then, as @MustafaAydın says, you cannot use `accuracy` as your metric. – Ben Reiniger Feb 28 '21 at 18:37

1 Answers1

0

Yes you can use GridSearchCV with the KNeighboursRegressor.

As you have a metric choice problem, you can read the metrics documentation here : https://scikit-learn.org/stable/modules/model_evaluation.html

The metrics appropriate for a regression problem are different than from classification problems, and you have the list here for appropritae regression metrics:

‘explained_variance’
‘max_error’
‘neg_mean_absolute_error’
‘neg_mean_squared_error’
‘neg_root_mean_squared_error’
‘neg_mean_squared_log_error’
‘neg_median_absolute_error’
‘r2’
‘neg_mean_poisson_deviance’
‘neg_mean_gamma_deviance’
‘neg_mean_absolute_percentage_error’

So you can chose one to replace "accuracy" and test it.

Malo
  • 1,233
  • 1
  • 8
  • 25