Python sklearn polynomial preprocessing and dimensional problems

Question

I am experimenting the fit of 1-3 degree polynomial transformation to the original data using 100 predicted values each. I first 1) reshaped the original data, 2) applied fit_transform on the test set and prediction space (of data features), 3) obtained linear prediction on the prediction space, and 4) exported them into an array, using the following code:

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
x = x.reshape(-1, 1)
y = y.reshape(-1, 1)    
pred_data = np.linspace(0,10,100).reshape(-1,1)
results = []

for i in [1, 2, 3] :
    poly = PolynomialFeatures(degree = i)
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
    x_poly1 = poly.fit_transform(x_train)
    pred_data = poly.fit_transform(pred_data)
    linreg1 = LinearRegression().fit(x_poly1, y_train)
    pred = linreg1.predict(pred_data)
    results.append(pred)
results

However, I did not get what I wanted, Python did not return an array of (3, 100) shape as I was expecting and, in fact, I received an error message

ValueError: shapes (100,10) and (4,1) not aligned: 10 (dim 1) != 4 (dim 0)

Seems to be a dimensional problem resulting either from "reshape" or from the "fit_transform" step. I got confused as this was supposed to be straightforward test. Would anyone enlighten me on this? It will be much appreciated.

Thank you.

Sincerely,

You should always call just `transform()` on test data, the `pred_data` in your case. `fit_transform()` call forgets the previous calls and learns the data again and can result in different dimensions. — Vivek Kumar, Jun 11 '17 at 14:53
I tried using poly.transform() on pred_data, Python still returned an error message: X shape does not match training shape. — Chris T., Jun 11 '17 at 15:18

score 0 · Answer 1 · answered Jun 12 '17 at 05:37

First, as I suggested in comment, you should always call just transform() on test data (pred_data in your case).

But even if you do that, a different error occurs. The error is due to this line:

pred_data = poly.fit_transform(pred_data)

Here you are replacing the original pred_data with the transformed version. So for first iteration of loop, it works, but for second and third iteration it becomes invalid, because it requires the original pred_data of shape (100,1) defined in this line above the for loop:

pred_data = np.linspace(0,10,100).reshape(-1,1)

Change the name of variable inside the loop to something else and all works well.

for i in [1, 2, 3] :
    poly = PolynomialFeatures(degree = i)
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
    x_poly1 = poly.fit_transform(x_train)

    # Changed here
    pred_data_poly1 = poly.transform(pred_data)

    linreg1 = LinearRegression().fit(x_poly1, y_train)
    pred = linreg1.predict(pred_data_poly1)
    results.append(pred)
results

Python sklearn polynomial preprocessing and dimensional problems

1 Answers1