From your data shown in example_data
, it looks like you are working with data in a Pandas DataFrame
. So, I would suggest another possible approach to answering your question
Here is some data I generated in the same format as yours but with extra rows
d = [
['inches','city','Pizza_Price'],
[5,'A',10],
[6,'B',12],
[7,'C',15],
[8,'D',11],
[9,'B',12],
[10,'C',17],
[11,'D',16]
]
df = pd.DataFrame(d[1:], columns=d[0])
print(df)
inches city Pizza_Price
0 5 A 10
1 6 B 12
2 7 C 15
3 8 D 11
4 9 B 12
5 10 C 17
6 11 D 16
The conversion of the city
column into integers can be done using LabelEncoder
(as shown in this SO post), per @Wen-Ben's suggestion
df['city'] = pd.DataFrame(columns=['city'],
data=LabelEncoder().fit_transform(
df['city'].values.flatten())
)
print(df)
inches city Pizza_Price
0 5 0 10
1 6 1 12
2 7 2 15
3 8 3 11
4 9 1 12
5 10 2 17
6 11 3 16
Step 1. Perform the train-test split to get the training and testing data X_train
, y_train
, etc.
features = ['inches', 'city']
target = 'Pizza_Price'
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)
# (OPTIONAL) Check number of rows in X and y of each split
print(len(X_train), len(y_train))
print(len(X_test), len(y_test))
4 4
3 3
Step 2. (Optional) Append a column to your source DataFrame
(example_data
) that shows which rows are used in training and testing
df['Type'] = 'Test'
df.loc[X_train.index, 'Type'] = 'Train'
Step 3. Instantiate the LinearRegression
model and train the model using the training dataset - see this link from sklearn
docs
model = LinearRegression()
model.fit(X_train, y_train)
Step 4. Now, make out-of-sample predictions on the testing data and (optionally) append the predicted values as a separate column to the example_data
- the rows used in the training dataset will have no prediction so they will be assigned
NaN
- the rows used in the testing dataset will have a prediction
df['Predicted_Pizza_Price'] = np.nan
df.loc[X_test.index, 'Predicted_Pizza_Price'] = model.predict(X_test)
print(df)
inches city Pizza_Price Type Predicted_Pizza_Price
0 5 0 10 Test 11.0
1 6 1 12 Test 11.8
2 7 2 15 Train NaN
3 8 3 11 Train NaN
4 9 1 12 Train NaN
5 10 2 17 Test 14.0
6 11 3 16 Train NaN
Step 5. Generate model evaluation metrics (see point number 15. from here)
- we will generate a Pandas
DataFrame
showing both the (a) model evaluation metrics and (b) model properties - the Linear Regression coefficients and the intercept
- we will first generate a Python dictionary that contains all these values and then convert the dictionary to a Pandas
DataFrame
Create a blank dictionary to hold the model properties (coefficient, intercept) and evaluation metrics
dict_summary = {}
Append coefficient and intercept to dictionary
for m,feature in enumerate(features):
dict_summary['Coefficient ({})' .format(feature)] = model.coef_[m]
dict_summary['Intercept'] = model.intercept_
Append evaluation metrics to dictionary
y_test = df.loc[X_test.index, 'Pizza_Price'].values
y_pred = df.loc[X_test.index, 'Predicted_Pizza_Price'].values
dict_summary['Mean Absolute Error (MAE)'] = metrics.mean_absolute_error(
y_test, y_pred)
dict_summary['Mean Squared Error (MSE)'] = metrics.mean_squared_error(
y_test, y_pred)
dict_summary['Root Mean Squared Error (RMSE)'] = np.sqrt(
metrics.mean_squared_error(y_test, y_pred)
)
Convert dictionary into summary DataFrame
showing regression model properties and evaluation metrics
df_metrics = pd.DataFrame.from_dict(dict_summary, orient='index', columns=['value'])
df_metrics.index.name = 'metric'
df_metrics.reset_index(drop=False, inplace=True)
Output of model evaluation DataFrame
print(df_metrics)
metric value
0 Coefficient (inches) 0.466667
1 Coefficient (city) 0.333333
2 Intercept 8.666667
3 Mean Absolute Error (MAE) 1.400000
4 Mean Squared Error (MSE) 3.346667
5 Root Mean Squared Error (RMSE) 1.829390
Using this approach, since you have results in Pandas 2 DataFrame
s, Pandas tools can be used to visualize the results of the regression analysis.