Trouble fitting a polynomial regression curve in sklearn

Question

I am new to sklearn and I have an appropriately simple task: given a scatter plot of 15 dots, I need to

Take 11 of them as my 'training sample',
Fit a polynomial curve of degree 3 through these 11 dots;
Plot the resulting polynomial curve over the 15 dots.

But I got stuck at the second step.

This is the data plot:

%matplotlib notebook

import numpy as np from sklearn.model_selection 
import train_test_split from sklearn.linear_model 
import LinearRegression from sklearn.preprocessing import PolynomialFeatures

np.random.seed(0) 
n = 15 
x = np.linspace(0,10,n) + np.random.randn(n)/5 
y = np.sin(x)+x/6 + np.random.randn(n)/10

X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)

plt.figure() plt.scatter(X_train, y_train, label='training data') 
plt.scatter(X_test, y_test, label='test data') 
plt.legend(loc=4);

I then take the 11 points in X_train and transform them with a poly features of degree 3 as follow:

degrees = 3
poly = PolynomialFeatures(degree=degree)

X_train_poly = poly.fit_transform(X_train)

Then I try to fit a line through the transformed points (note: X_train_poly.size = 364).

linreg = LinearRegression().fit(X_train_poly, y_train)

and I get the following error:

ValueError: Found input variables with inconsistent numbers of samples: [1, 11]

I have read various questions that address similar and often more complex problems (e.g. Multivariate (polynomial) best fit curve in python?), but I could not extract a solution from them.

possible duplicate: https://stackoverflow.com/questions/32097392/sklearn-issue-found-arrays-with-inconsistent-numbers-of-samples-when-doing-regr — Moritz, Jun 13 '17 at 14:25

score 3 · Accepted Answer · answered Jun 14 '17 at 04:57

The issue is the dimension in the X_train and y_train. It is a single-dimension array so it is treating each of the X records as a separate variable.

Using the .reshape command as follows should do the trick:

# reshape data to have 11 records rather than 11 columns
X_trainT     = X_train.reshape(11,1)
y_trainT     = y_train.reshape(11,1)

# create polynomial features on the single va
poly         = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X_trainT)

print (X_train_poly.shape)
# 

linreg       = LinearRegression().fit(X_train_poly, y_trainT)

score 0 · Answer 2 · answered Jun 13 '17 at 17:35

The error basically mean your X_train_poly and y_train doesn't match, where your X_train_poly has only 1 set of x and your y_train has 11 values. I'm not quite sure what you want, but I guess the polynomial features were not generated in the way you want. What your code currently doing is to generated the degree-3 polynomial features for a single 11-dimension point.

I think you want to generated the degree-3 polynomial features for every points (actually every x) of your 11 points. You can use a loop or list comprehension to do that:

X_train_poly = poly.fit_transform([[i] for i in X_train])
X_train_poly.shape
# (11, 4)

Now you can see your X_train_poly has 11 points where each point is 4-dimension, rather than a single 364-dimension point. This new X_train_poly matches the shape of y_train and the regression may give you what you want:

linreg = LinearRegression().fit(X_train_poly, y_train)
linreg.coef_
# array([ 0.        , -0.79802899,  0.2120088 , -0.01285893])

Trouble fitting a polynomial regression curve in sklearn

2 Answers2