Why is R2 negative even though co-relation exists?

Question

I am new to ML, and I am using the following code to figure out RMSE & R2. However, the R2 value is shown as: -43.13.

I have already gone through few posts on Stackoverflow mentioning the significance of negative R2. But in my data set, it is clear that as 'certifications' data increases, so does the 'salary'. So there is clearly a positive correlation between them. Then why is R2 negative?

Certifications data: [ 2.  3.  5.  6.  7.  9. 10. 14.]

Salary data: [22000. 23000. 24000. 28000. 33000. 42000. 44000. 53000.]

model=LinearRegression()

certification_train,certification_test,salary_train,salary_test=train_test_split(certifications,salary,test_size=0.2)

model.fit(certification_train.reshape(-1,1), salary_train.reshape(-1,1))    

salary_prediction=model.predict(certification_test.reshape(-1,1))

print("R2:",r2_score(salary_test,salary_prediction))

Can you show what `model` and `r2_score` are? And do you train your model on `certification_train` and `salary_train`? — ignoring_gravity, Nov 01 '19 at 14:12
Ok. And you're doing `model.fit(certification_train, salary_train)` right after the `train_test_split` line? — ignoring_gravity, Nov 01 '19 at 14:27
It's very hard to read your code example. It seems `model.fit(certification_train.reshape(-1,1), salary_train.reshape(-1,1))` is run before `certification_train` is defined — KPLauritzen, Nov 01 '19 at 14:31
Sorry for the wrong edit @KPLauritzen. I have re-formatted it now. Not sure why Stackoverflow was adding extra spaces. — quietboy, Nov 01 '19 at 14:35
Generally speaking, you may went to notice that R2 is practically **never** used in predictive ML settings, as well as that R2 for a *test* set is not a well-defined notion; see the last part of [this answer](https://stackoverflow.com/questions/54614157/scikit-learn-statsmodels-which-r-squared-is-correct/54618898#54618898). — desertnaut, Nov 01 '19 at 15:03

score 1 · Answer 1 · answered Nov 01 '19 at 14:45

1

This is due to you having a really small sample size.

When I try running your code, I get

R2: 0.9030842872008327

With such a small sample size (2 samples in your test set, 8 in your train), you can't expect a model to do well, and how well it performs is predominantly determined by which samples are sent to train and which to test by train_test_split.

Try changing your train_test_split line to

certification_train,certification_test,salary_train,salary_test=train_t
st_split(np.array(certifications),np.array(salary),test_size=0.2, random_state=1)

and see how much your R2 changes according to which random state you pick!

answered Nov 01 '19 at 14:45

ignoring_gravity

6,677
4
32
65

Thanks for replying. Random_state=0 gives a much better value than 1. i will check it further. – quietboy Nov 01 '19 at 14:50
Can you please post you _complete_ code then? Currently, if copy-and-pasted, it doesn't run – ignoring_gravity Nov 01 '19 at 14:50

Why is R2 negative even though co-relation exists?

1 Answers1