1

I wrote the following code to learn the score in the machine learning methods. but I get the following error. what would be the reason??

veri = pd.read_csv("deneme2.csv")

veri = veri.drop(['id'], axis=1)

y = veri[['Rating']]
x = veri.drop(['Rating','Genres'], axis=1)


X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)


DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)
ytahmin = DTR.predict(x)
DTR.fit(veri[['Reviews','Size','Installs','Type','Price','Content Rating','Category_c']],veri.Rating)
basari_DTR = DTR.score(X_test,y_test)
#print("DecisionTreeRegressor: Yüzde",basari_DTR*100," oranında:" )
a = np.array([159,19000000.0,10000,0,0.0,0,0]).reshape(1, -1)
predict_DTR = DTR.predict(a)
print(f1_score(y_train, y_test, average='macro')) 

Error: Found input variables with inconsistent numbers of samples: [6271, 3089]

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Murat Kılınç
  • 175
  • 3
  • 16
  • is the len(x) nd len(y) same for your input?? y = veri[['Rating']] x = veri.drop(['Rating','Genres'], axis=1) please mention @ which line you got the error – Bikiran Das Aug 27 '19 at 08:51
  • `ytahmin = DTR.predict(x)` doesnt has any meaning. You try to predict on all you'r data when you should only predict on training set (X_train) or validation set (X_test) – akhetos Aug 27 '19 at 08:55
  • @BikiranDas Yes. Both are same input. (len(x)=9360 and len(y)=9360). I think x = veri.drop(['Rating','Genres'], axis=1) give error – Murat Kılınç Aug 27 '19 at 09:22

2 Answers2

1

f1_score needs to take true y from test and the one you predicted on test set, hence last lines should be:

DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)

y_pred = DTR.predict(X_test)
print(f1_score(y_pred, y_test, average='macro')) 

You shouldn't call fit twice and the shape of your predictions has to be of the same length as test, see some sklearn basic tutorials for more info.

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
1

There are at least two issues with your code.

The first error you report

print(f1_score(y_train, y_test, average='macro')) 
Error: Found input variables with inconsistent numbers of samples: [6271, 3089]

is due to your y_train and y_test having different lengths, as already pointed out in the other answer.

But this is not the main issue here, because, even if you change y_train to y_pred, as suggested, you get a new error:

print(f1_score(y_pred, y_test, average='macro')) 
Error: continuous is not supported 

This is simply because you are in a regression setting, while the f1 score is a classification metric and, as such, it does not work with continuous predictions.

In other words, f1 score is inappropriate for your (regression) problem, hence the errror.

Check the list of metrics available in scikit-learn, where you can confirm that f1 score is used only in classification, and pick up another metric suitable for regression problems.

For a more detailed exposition about what happens when choosing inappropriate metrics in scikit-learn, see Accuracy Score ValueError: Can't Handle mix of binary and continuous target

desertnaut
  • 57,590
  • 26
  • 140
  • 166