1

Implementing linear regression as below:

from sklearn.linear_model import LinearRegression

x = [1,2,3,4,5,6,7]
y = [1,2,1,3,2.5,2,5]

# Create linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit([x], [y])

# print(x)
regr.predict([[1, 2000, 3, 4, 5, 26, 7]])

produces :

array([[1. , 2. , 1. , 3. , 2.5, 2. , 5. ]])

In utilizing the predict function why cannot utilize a single x value in order to make prediction?

Trying regr.predict([[2000]])

returns:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-3a8b477f5103> in <module>()
     11 
     12 # print(x)
---> 13 regr.predict([[2000]])

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/base.py in predict(self, X)
    254             Returns predicted values.
    255         """
--> 256         return self._decision_function(X)
    257 
    258     _preprocess_data = staticmethod(_preprocess_data)

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/base.py in _decision_function(self, X)
    239         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
    240         return safe_sparse_dot(X, self.coef_.T,
--> 241                                dense_output=True) + self.intercept_
    242 
    243     def predict(self, X):

/usr/local/lib/python3.6/dist-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 

ValueError: shapes (1,1) and (7,7) not aligned: 1 (dim 1) != 7 (dim 0)
blue-sky
  • 51,962
  • 152
  • 427
  • 752
  • 2
    So it seems like the function makes a 7D prediction!. So the model think you input one sample X which is 7D and produce an output which is y and also 7D. Thus you new input doesn't fit. MAybe you should ravel you in and output to (7,1) dim vectors. – Quickbeam2k1 Apr 29 '18 at 20:06
  • Adding to @Quickbeam2k1 's comment. Use [reshape](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html). For example, `X = np.reshape(x, (7,1))` Then fit your model and it should work as expected. – W Stokvis Apr 29 '18 at 22:14
  • @Quickbeam2k1 Its 2D (not 7D). Just that the second dimension has 7 elements. – Vivek Kumar May 01 '18 at 11:01
  • So each vector you pass is 7D. Also your answer below uses 7d vectors for x and y – Quickbeam2k1 May 01 '18 at 16:35
  • @Quickbeam2k1 7 elements dont mean 7d. – Vivek Kumar May 02 '18 at 09:08
  • Sorry, maybe this is due to my mathematical background, a 2D Matrix consisting of m rows and n columns, consists of n (m,1) (m-D) vectors or of m (1, n) (n-D) vectors. so 7 elements does not mean 7d, but it can be interpreted like that (in this case here). btw, if one wouldn't interpret it as such structures, the curse of dimensionality wouldn't play such an important role since then most of the problems could just be cast into 2D-matrices or 3D Tensors – Quickbeam2k1 May 02 '18 at 10:51

1 Answers1

14

When you do this:

regr.fit([x], [y])

You are essentially inputing this:

regr.fit([[1,2,3,4,5,6,7]], [[1,2,1,3,2.5,2,5]])

that has a shape of (1,7) for X and (1,7) for y.

Now looking at the documentation of fit():

Parameters:

X : numpy array or sparse matrix of shape [n_samples,n_features]
    Training data

y : numpy array of shape [n_samples, n_targets]
    Target values. Will be cast to X’s dtype if necessary

So here, what the model assumes it that you have data which have data has 7 features and you have 7 targets. Please see this for more information on multi-output regression.

So at the prediction time, model will require data with 7 features, something of shape (n_samples_to_predict, 7) and will output the data with shape (n_samples_to_predict, 7).

If instead, you wanted to have something like this:

  x   y
  1  1.0
  2  2.0
  3  1.0
  4  3.0
  5  2.5
  6  2.0
  7  5.0

then you need to have a shape of (7,1) for input x and (7,) or (7,1) for target y.

So as @WStokvis said in comments, you need to do this:

import numpy as np
X = np.array(x).reshape(-1, 1)
y = np.array(y)          # You may omit this step if you want

regr.fit(X, y)           # Dont wrap it in []

And then again at prediction time:

X_new = np.array([1, 2000, 3, 4, 5, 26, 7]).reshape(-1, 1)
regr.predict(X_new)

And then doing the following will not raise error:

regr.predict([[2000]])

because the required shape is present.

Update for the comment:-

When you do [[2000]], it will be internally converted to np.array([[2000]]), so it has the shape (1,1). This is similar to (n_samples, n_features), where n_features = 1. This is correct for the model because at the training, the data has shape (n_samples, 1). So this works.

Now lets say, you have:

X_new = [1, 2000, 3, 4, 5, 26, 7] #(You havent wrapped it in numpy array and reshape(-1,1) yet

Again, it will be internally transformed as this:

X_new = np.array([1, 2000, 3, 4, 5, 26, 7])

So now X_new has a shape of (7,). See its only a one dimensional array. It doesn't matter if its a row vector or a column vector. Its just one-dimensional array of (n,).

So scikit may not infer whether its n_samples=n and n_features=1 or other way around (n_samples=1 and n_features=n). Please see my other answer which explains about this.

So we need to explicitly convert the one-dimensional array to 2-d by reshape(-1,1). Hope its clear now.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • 1
    thanks why do need code 'X_new = np.array([1, 2000, 3, 4, 5, 26, 7]).reshape(-1, 1) regr.predict(X_new)' as 'regr.predict([[2000]])' appears to work without it ? – blue-sky May 01 '18 at 19:36
  • @blue-sky I have updated the answer for your comment. Please take a look and ask if still not clear. – Vivek Kumar May 02 '18 at 05:27
  • @VivekKumar Thanks Vivek for explaining it so beautifully and in detail. I was struggling a lot to look for a way to fix the same type of issue I was facing. – Omi Jul 13 '19 at 17:29