How to calculate r-squared with python?

Question

I have fitted a model from which I'd like to know the scores (r-squared). The data is split into a training and testing set. Although the model is only trained using the training set, how is it possible that my r-squared for my testing data is higher? I mean the model has never seen the testing set, but is more accurate than with the training set... Am I interpreting something wrong?

My code: import pandas as pd

import numpy
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import scipy
import sklearn
from sklearn.linear_model import LinearRegression
from scipy import stats
from sklearn.metrics  import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

df=pd.read_csv("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses- 
data/CognitiveClass/DA0101EN/module_5_auto.csv")
df=df._get_numeric_data()


y_data = df['price']
x_data=df.drop('price',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, 
test_size=0.15, random_state=1)
lr=LinearRegression()
lr.fit(x_train[['horsepower']], y_train)
h=lr.score(x_train[['horsepower']], y_train).mean()
h2=lr.score(x_test[['horsepower']], y_test).mean()
print(h,h2)

We're going to need **far** more information than this to solve the issue. As an aside, why is there so little whitespace in your code? Also, sharing screenshots of code/data is discouraged. — AMC, Jan 08 '20 at 21:30
it does not make a difference if i show you the imported libraries — , Jan 09 '20 at 18:06
_it does not make a difference if i show you the imported libraries_ Alright? — AMC, Jan 09 '20 at 18:08
I'm confused as to what your first comment had to do with mine. — AMC, Jan 09 '20 at 18:58
You said that you needed mor information and I only had the libraries left... — , Jan 09 '20 at 19:06
"the model has never seen the testing set, but is more accurate than with the training set... Am I interpreting something wrong" Yes: R2 is not a measure of accuracy. RMSE and similar error metrics are. — asachet, Jan 13 '20 at 16:13

Mihkuno · Answer 1 · 2022-04-05T16:11:37.277

To put it simply, R-Squared is used to find the 'difference in percent' or calculate the accuracy of two time-series datasets.

Formula

Note: squaring Pearsons-r, squaring pandas corr(), or r^2 have slightly different results than R^2 formula shown above, this is due to 'statistic round up' reasons... refer to Max Pierini's answer

SciKit Learn R-squared is very different from square of Pearson's Correlation R

Method 1: function

def r_squared(y, y_hat):
    y_bar = y.mean()
    ss_tot = ((y-y_bar)**2).sum()
    ss_res = ((y-y_hat)**2).sum()
    return 1 - (ss_res/ss_tot)

Method 2: sklearn library

from sklearn.metrics import r2_score

r2 = r2_score(actual, predicted)

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

score 1 · Accepted Answer · answered Jan 08 '20 at 21:14

1

It looks like you're using scikit-learn. If so, you can use the r2_score metric.

answered Jan 08 '20 at 21:14

JP Alioto

44,864
6
88
112

Ok, but what is the difference between them both? I mean it should work.. – Jan 08 '20 at 21:23
But why does my method doesnt work?If i use your method i get negative numbver which are very big. – Jan 08 '20 at 21:29
1

Because your fit is terrible? I have no idea. I don’t see your data. – JP Alioto Jan 09 '20 at 02:49
You see how i fitted them – Jan 09 '20 at 18:57

How to calculate r-squared with python?

2 Answers2

Method 1: function

Method 2: sklearn library