sklearn issue: Found arrays with inconsistent numbers of samples when doing regression

Question

this question seems to have been asked before, but I can't seem to comment for further clarification on the accepted answer and I couldn't figure out the solution provided.

I am trying to learn how to use sklearn with my own data. I essentially just got the annual % change in GDP for 2 different countries over the past 100 years. I am just trying to learn using a single variable for now. What I am essentially trying to do is use sklearn to predict what the GDP % change for country A will be given the percentage change in country B's GDP.

The problem is that I receive an error saying:

ValueError: Found arrays with inconsistent numbers of samples: [ 1 107]

Here is my code:

import sklearn.linear_model as lm
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import matplotlib.dates as mdates


def bytespdate2num(fmt, encoding='utf-8'):#function to convert bytes to string for the dates.
    strconverter = mdates.strpdate2num(fmt)
    def bytesconverter(b):
        s = b.decode(encoding)
        return strconverter(s)
    return bytesconverter

dataCSV = open('combined_data.csv')

comb_data = []

for line in dataCSV:
    comb_data.append(line)

date, chngdpchange, ausgdpchange = np.loadtxt(comb_data, delimiter=',', unpack=True, converters={0: bytespdate2num('%d/%m/%Y')})


chntrain = chngdpchange[:-1]
chntest = chngdpchange[-1:]

austrain = ausgdpchange[:-1]
austest = ausgdpchange[-1:]

regr = lm.LinearRegression()
regr.fit(chntrain, austrain)

print('Coefficients: \n', regr.coef_)

print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(chntest) - austest) ** 2))

print('Variance score: %.2f' % regr.score(chntest, austest))

plt.scatter(chntest, austest,  color='black')
plt.plot(chntest, regr.predict(chntest), color='blue')

plt.xticks(())
plt.yticks(())

plt.show()

What am I doing wrong? I essentially tried to apply the sklearn tutorial (They used some diabetes data set) to my own simple data. My data just contains the date, country A's % change in GDP for that specific year, and country B's % change in GDP for that same year.

I tried the solutions here and here (basically trying to find more out about the solution in the first link), but just receive the exact same error.

Here is the full traceback in case you want to see it:

Traceback (most recent call last):
  File "D:\My Stuff\Dropbox\Python\Python projects\test regression\tester.py", line 34, in <module>
    regr.fit(chntrain, austrain)
  File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\linear_model\base.py", line 376, in fit
    y_numeric=True, multi_output=True)
  File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 454, in check_X_y
    check_consistent_length(X, y)
  File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 174, in check_consistent_length
    "%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [  1 107]

Check the shapes of `chntrain` and `austrain` before splitting into training/test sets. They should have the same shape; the error seems to be indicating that the sizes are not the same — Ryan, Aug 19 '15 at 13:57
How can I do that? I've been googling but every solution that let's me reshape or find the shape just gives the error: IndexError: too many indices for array — pyman, Aug 19 '15 at 14:29

score 6 · Answer 1 · answered Jul 01 '16 at 13:53

6

In fit(X,y),the input parameter X is supposed to be a 2-D array. But if X in your data is only one-dimension, you can just reshape it into a 2-D array like this:regr.fit(chntrain_X.reshape(len(chntrain_X), 1), chntrain_Y)

answered Jul 01 '16 at 13:53

Chang Men

61
1
2

score 0 · Answer 2 · answered Aug 19 '15 at 21:26

regr.fit(chntrain, austrain)

This doesn't look right. The first parameter to fit should be an X, which refers to a feature vector. The second parameter should be a y, which is the correct answers (targets) vector associated with X.

For example, if you have GDP, you might have:

X[0] = [43, 23, 52] -> y[0] = 5
# meaning the first year had the features [43, 23, 52] (I just made them up)
# and the change that year was 5

Judging by your names, both chntrain and austrain are feature vectors. Judging by how you load your data, maybe the last column is the target?

Maybe you need to do something like:

chntrain_X, chntrain_y = chntrain[:, :-1], chntrain[:, -1]
# you can do the same with austrain and concatenate them or test on them if this part works
regr.fit(chntrain_X, chntrain_y)

But we can't tell without knowing the exact storage format of your data.

qg_jinn · Answer 3 · 2015-10-22T10:48:31.970

0

Try changing chntrain to a 2-D array instead of 1-D, i.e. reshape to (len(chntrain), 1).

For prediction, also change chntest to a 2-D array.

edited Oct 22 '15 at 10:48

answered Oct 22 '15 at 10:34

qg_jinn

71
1
5

score 0 · Answer 4 · answered Dec 15 '16 at 11:23

I have been having similar problems to you and have found a solution.

Where you have the following error:

ValueError: Found arrays with inconsistent numbers of samples: [  1 107]

The [ 1 107] part is basically saying that your array is the wrong way around. Sklearn thinks you have 107 columns of data with 1 row.

To fix this try transposing the X data like so:

chntrain.T

The re-run your fit:

regr.fit(chntrain, austrain)

Depending on what your "austrain" data looks like you may need to transpose this too.

score 0 · Answer 5 · answered Dec 17 '16 at 05:38

0

You may use np.newaxis as well. The example can be X = X[:, np.newaxis]. I found the method at Logistic function

answered Dec 17 '16 at 05:38

Cloud Cho

1,594
19
22

sklearn issue: Found arrays with inconsistent numbers of samples when doing regression

5 Answers5

Linked