I would like to build a dataframe that compares the predicted results of a regression model (y_hat) with the test data (y_test). I have two access methods for selecting the test data. Access method 1 works but Access method 2 doesn't when I try to build the comparison dataframe.
Access method 1:
X_data = df_scores[['Hours']]
y_data = df_scores['Scores']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20, random_state=0)
lm = LinearRegression()
lm.fit(X_train, y_train)
y_hat = lm.predict(X_test)
This dataframe works:
df_scores_comp = pd.DataFrame({'Actual':y_test, 'Predicted':y_hat})
df_scores_comp
Access method 2:
But if I want to use the following way to access X_data and y_data ...
X_data = df_scores.loc[:, ['Hours']]
y_data = df_scores.loc[:, ['Scores']]
I get the following error ...
If using all scalar values, you must pass an index
When using either access method, y_hat is an array and X_data is a dataframe. But y_data is a series using the first access method and a dataframe in the second access method. I thought the clue might be in there somewhere with lm.predict but I can't figure it out.
When I tried the solution offered here (Constructing pandas dataframes...) by wrapping the dictionary in a list, I don't get an error. But the result isn't right: the y_hat (predicted) elements are in the correct column, but are squeezed into one row. And the y_test (Actual) elements and the index values are mixed up in the wrong columns and are squeezed into one row as well. Something like this:
Actual Predicted
0 Scores 5 20 2 27 19 69 16... [[16.884144762398048], [33.73226077948985], [7...
It should look like this (which is does using the first access method):
Actual Predicted
5 20 16.884145
2 27 33.732261
19 69 75.357018
16 30 26.794801
11 62 60.491033