2

I am new to ML, and I am using the following code to figure out RMSE & R2. However, the R2 value is shown as: -43.13.

I have already gone through few posts on Stackoverflow mentioning the significance of negative R2. But in my data set, it is clear that as 'certifications' data increases, so does the 'salary'. So there is clearly a positive correlation between them. Then why is R2 negative?

Certifications data: [ 2.  3.  5.  6.  7.  9. 10. 14.]

Salary data: [22000. 23000. 24000. 28000. 33000. 42000. 44000. 53000.]

model=LinearRegression()

certification_train,certification_test,salary_train,salary_test=train_test_split(certifications,salary,test_size=0.2)

model.fit(certification_train.reshape(-1,1), salary_train.reshape(-1,1))    

salary_prediction=model.predict(certification_test.reshape(-1,1))

print("R2:",r2_score(salary_test,salary_prediction))
quietboy
  • 159
  • 11
  • Can you show what `model` and `r2_score` are? And do you train your model on `certification_train` and `salary_train`? – ignoring_gravity Nov 01 '19 at 14:12
  • Sorry, forgot to mention: model=LinearRegression() – quietboy Nov 01 '19 at 14:20
  • Ok. And you're doing `model.fit(certification_train, salary_train)` right after the `train_test_split` line? – ignoring_gravity Nov 01 '19 at 14:27
  • 1
    It's very hard to read your code example. It seems `model.fit(certification_train.reshape(-1,1), salary_train.reshape(-1,1))` is run before `certification_train` is defined – KPLauritzen Nov 01 '19 at 14:31
  • 2
    Sorry for the wrong edit @KPLauritzen. I have re-formatted it now. Not sure why Stackoverflow was adding extra spaces. – quietboy Nov 01 '19 at 14:35
  • 2
    Generally speaking, you may went to notice that R2 is practically **never** used in predictive ML settings, as well as that R2 for a *test* set is not a well-defined notion; see the last part of [this answer](https://stackoverflow.com/questions/54614157/scikit-learn-statsmodels-which-r-squared-is-correct/54618898#54618898). – desertnaut Nov 01 '19 at 15:03

1 Answers1

1

This is due to you having a really small sample size.

When I try running your code, I get

R2: 0.9030842872008327

With such a small sample size (2 samples in your test set, 8 in your train), you can't expect a model to do well, and how well it performs is predominantly determined by which samples are sent to train and which to test by train_test_split.

Try changing your train_test_split line to

certification_train,certification_test,salary_train,salary_test=train_t
st_split(np.array(certifications),np.array(salary),test_size=0.2, random_state=1) 

and see how much your R2 changes according to which random state you pick!

ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65