0

I have dataset where I transformed categorical dataset into numerical by dummies and I ran simple linear regression model to predict dependent variable. I got adjusted R-square as 0.66. Now I want to cross validate my model with leave one out method and want to see the LOOCV adjusted r-square, whether it is similar to my pre cross validation model.

cv = LeaveOneOut()
data = pd.read_excel(r'C:/Users/LENOVO/Documents/Diwali_Impact_coding/Modelling/Model_Data.xlsx', usecols=['PMlog', 'Temp', 'RH', 'WSlog', 'Type', 'Popu', 'FRPlog', 'Region'], sheet_name='City_cook2')
data.dropna(subset=['PMlog', 'Temp', 'RH', 'WSlog'], inplace=True)

data_log1 = pd.get_dummies(data, columns=['Type', 'Region', 'Popu'])  # all NUMERICAL FEATURES
X = data_log1.loc[:, data_log1.columns != 'PMlog']  # Indepenedent/Predictor Variables
y = data_log1.loc[:, 'PMlog']  # Dependent Variable
    
model_LR = LinearRegression()
    
model_LR.fit(X,y)

def adj_Rsqr(model_LR, X, y):
   xx = 1 - (1 - model_LR.score(X, y)) * (len(y) - 1) / (len(y) - X.shape[1] - 1)
   return xx

adj_Rsqr(model_LR,X,y)  # 0.66
scores = cross_val_score(model_LR, X, y, scoring=adj_Rsqr, cv=cv, n_jobs=-1)
mean(scores)

My scores values are coming nan Can anybody help me why its is coming as nan. Also, if I uses scoring as R2 then also it is coming as nan but with not other scoring such as absolute error etc.

Thank you for every help.

1 Answers1

0

Cross validation is the process of splitting your data into a training and a test split for the purpose of model validation on kind of different data sets. When you apply LeaveOneOut cross validation, the test split is just one sample and the train split is all the other samples. An r-squared does not make a lot of sense just for a single sample in the test split.

When I coded LOOCV for sklearn's diabetes dataset to reproduce the behavior you got, I got the following warning:

UndefinedMetricWarning: R^2 score is not well-defined with less than two samples.
  warnings.warn(msg, UndefinedMetricWarning)

Consequently, a possible solution could be using KFold cross validation and choosing a high k, where k < n/2, so that you have at least two samples in each test split.

Your scoring function can be improved: You could use a sklearn.metrics.make_scorer. make_scorer accepts a scoring function. According to the documentation, the scoring function must have the signature score_func(y, y_pred, **kwargs).

So, a scoring function and a scorer could look like this in your case:

def adjusted_r2_score(y_true, y_pred, n_features): # It takes the actual and predicted values instead of the estimator
    r2 = r2_score(y_true, y_pred)
    n_samples = len(y_true)
    adj_r2 = 1 - (1 - r2) * (n_samples - 1) / (n_samples - n_features - 1)
    return adj_r2

adj_r2_scorer = make_scorer(adjusted_r2_score, greater_is_better=True, n_features=X.shape[1])

But then, your code has to change a little bit:

model_LR = LinearRegression()
model_LR.fit(X, y)

y_pred = model_LR.predict(X) # This step is new
score_no_cv = adjusted_r2_score(y, y_pred, X.shape[1])
print(f"Adjusted R-squared without CV: {score_no_cv:.4f}")

# Create a KFold cross-validation object with k as half of the number of samples
n_samples = len(y)
k = n_samples // 2
kf = KFold(n_splits=k)

# Calculate the Adjusted R-squared score with KFold cross-validation
scores_kf = cross_val_score(model_LR, X, y, cv=kf, scoring=adj_r2_scorer)
avg_score_kf = np.mean(scores_kf)
print(f"Average Adjusted R-squared with {k}-fold CV: {avg_score_kf:.4f}")

I liked the question, I had fun thinking about it.

DataJanitor
  • 1,276
  • 1
  • 8
  • 19