Python: Combine predicted y-variable labels to the dataframe

Question

I have a multi-class label prediction problem to identify lets say fruits for an example. I am able to get the prediction from the model, fit, and predict functions. I have also trained and tested the model. Below is the code. I am trying to merge my "y predictions" from a variable "forest_y_pred" to my original data set so that I can compare the Original Target Variable to Predicted Target Variable in a data frame. I have 2 questions:

1) Is y_test same as forest_y_pred = forest.predict(X_test). I am getting exact same results for when I compare. Am I getting this wrong? I am bit confused here, predict() is suppose to predict X_test not generate exact same results as y_test

2) I am trying to merge forest_y_pred = forest.predict(X_test) back to df. Here is what I tried from this: Merging results from model.predict() with original pandas DataFrame?

from sklearn.ensemble import RandomForestClassifier
import pandas as pd 

# Load Data
df = pd.read_excel('../data/file.xlsx',converters={'col1':str})
df = df.set_index('INDEX_ID') # Setting index id
df

# Doing this way because of setting index. INDEX_ID is a column in the df 
X_train, X_test, y_train, y_test = train_test_split(df.ix[:, ~df.columns.isin(['Target'])], df.Target,train_size=0.5)

print(y_test[:5])
type(y_test) #pandas.core.series.Series

ID
12      Apples
124     Oranges
345     Apples
123     Oranges
232     Kiwi

forest = RandomForestClassifier()

# Training
forest_model = forest.fit(X_train, y_train)
print(forest_model)

# Predictions
forest_y_pred = forest.predict(X_test) 
print("forest_y_pred:\n",forest_y_pred[:5])
['Apples','Oranges','Apples','Oranges','Kiwi']

y_test['preds'] = forest_y_pred
print(y_test['preds'][:5])
['Apples','Oranges','Apples','Oranges','Kiwi']

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
# ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
# How do I fix this? I tried ton of ways to convert ndarray, serries, dataframe...nothing is working so far what I tried. Thanks a bunch!!

If your predictions from your model are the exact same as the actual values, it's likely that you have some sort of [data leakage](https://machinelearningmastery.com/data-leakage-machine-learning/) in your model training. Have you tried a `predict_proba()` to see if your model is predicting with 100% probabilities, or done a `score` or `confusion_matrix` to check your predictions? — G. Anderson, Oct 04 '18 at 15:31

Python: Combine predicted y-variable labels to the dataframe

0 Answers0