My pipeline not imputing values correctly?

Question

I'm new to python and have been learning about pipelines from datacamp. I have been experimenting with some fifa data that has missing NaN values. I have tried to create a pipeline with the steps of imputing any missing data (replacing it with the mean) and then creating a logistic regression. I don't seem to get any errors in the output. However, when I print things such as print(x_train) and print(y_pred) the output still returns NaN values. Would that indicate that my Pipeline is not working and that the data was not correctly imputed as surely I should be seeing the mean values rather than NaN. Would appreciate if someone could answer the question in layman's terms as I am new to the topic.

fif_data=pd.read_csv("fifa_draft_1.csv")

df_Foot_Dummy=pd.get_dummies(fif_data, drop_first=True) 


imp=SimpleImputer(missing_values=np.nan, strategy="mean")

logreg=LogisticRegression()


x=df_Foot_Dummy["passing"].values.reshape(-1,1)
y=df_Foot_Dummy["preferred_foot_Right"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)

steps=[("imputation", imp),("logistic_regression",logreg)]
pipe=Pipeline(steps)
pipe.fit(x_train,y_train)
y_pred=pipe.predict(x_test)
print(x_train)
print(y_pred)

Hi, please read this: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples and update your code so we all can run the same thing. For instance you can replace the first line so we all have a sample of data into `fif_data` — Be Chiller Too, May 31 '21 at 12:58
Please print the shapes of x, x_train, x_test, y, y_train and y_test — Be Chiller Too, May 31 '21 at 13:01

score 1 · Answer 1 · answered May 31 '21 at 21:13

Pipelines do not change data in-place; at each step, the data is modified and passed along, but the intermediate results are not saved (with a partial exception when the cache parameter is set).

That the logistic regression doesn't complain indicates that the imputation has in fact happened.

y_pred shouldn't have any missing values; if that's the case, please let us know and provide an example dataset.

My pipeline not imputing values correctly?

1 Answers1