Found input variables with inconsistent numbers of samples error

Question

I wrote the following code to learn the score in the machine learning methods. but I get the following error. what would be the reason??

veri = pd.read_csv("deneme2.csv")

veri = veri.drop(['id'], axis=1)

y = veri[['Rating']]
x = veri.drop(['Rating','Genres'], axis=1)


X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)


DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)
ytahmin = DTR.predict(x)
DTR.fit(veri[['Reviews','Size','Installs','Type','Price','Content Rating','Category_c']],veri.Rating)
basari_DTR = DTR.score(X_test,y_test)
#print("DecisionTreeRegressor: Yüzde",basari_DTR*100," oranında:" )
a = np.array([159,19000000.0,10000,0,0.0,0,0]).reshape(1, -1)
predict_DTR = DTR.predict(a)
print(f1_score(y_train, y_test, average='macro'))

Error: Found input variables with inconsistent numbers of samples: [6271, 3089]

is the len(x) nd len(y) same for your input?? y = veri[['Rating']] x = veri.drop(['Rating','Genres'], axis=1) please mention @ which line you got the error — Bikiran Das, Aug 27 '19 at 08:51
`ytahmin = DTR.predict(x)` doesnt has any meaning. You try to predict on all you'r data when you should only predict on training set (X_train) or validation set (X_test) — akhetos, Aug 27 '19 at 08:55
@BikiranDas Yes. Both are same input. (len(x)=9360 and len(y)=9360). I think x = veri.drop(['Rating','Genres'], axis=1) give error — Murat Kılınç, Aug 27 '19 at 09:22

score 1 · Answer 1 · answered Aug 27 '19 at 08:53

1

f1_score needs to take true y from test and the one you predicted on test set, hence last lines should be:

DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)

y_pred = DTR.predict(X_test)
print(f1_score(y_pred, y_test, average='macro'))

You shouldn't call fit twice and the shape of your predictions has to be of the same length as test, see some sklearn basic tutorials for more info.

answered Aug 27 '19 at 08:53

Szymon Maszke

22,747
4
43
83

I tried the way you suggested. but this time I get this error. Error: continuous is not supported – Murat Kılınç Aug 27 '19 at 09:14
1

That's because f1 score is **inappropriate** for regression problems - see own answer – desertnaut Aug 27 '19 at 09:32

desertnaut · Accepted Answer · 2019-08-27T09:52:18.827

There are at least two issues with your code.

The first error you report

print(f1_score(y_train, y_test, average='macro')) 
Error: Found input variables with inconsistent numbers of samples: [6271, 3089]

is due to your y_train and y_test having different lengths, as already pointed out in the other answer.

But this is not the main issue here, because, even if you change y_train to y_pred, as suggested, you get a new error:

print(f1_score(y_pred, y_test, average='macro')) 
Error: continuous is not supported

This is simply because you are in a regression setting, while the f1 score is a classification metric and, as such, it does not work with continuous predictions.

In other words, f1 score is inappropriate for your (regression) problem, hence the errror.

Check the list of metrics available in scikit-learn, where you can confirm that f1 score is used only in classification, and pick up another metric suitable for regression problems.

For a more detailed exposition about what happens when choosing inappropriate metrics in scikit-learn, see Accuracy Score ValueError: Can't Handle mix of binary and continuous target

Found input variables with inconsistent numbers of samples error

2 Answers2