In the linear model = 0 + 1 × i + 2 × j + 3 × k + , what values for ,j,k ∈ [1,100] results in the model with the highest R-Squared?
The data set consists of 100 independent variables and one dependent variable. Each variable has 50 observations.
My only guess is to loop through all possible combinations of three variables and compare R-squared for each combination. The way I have done it with Python is:
import itertools as itr
import pandas as pd
import time as t
from sklearn import linear_model as lm
start = t.time()
#linear regression model
LR = lm.LinearRegression()
#import data
data = pd.read_csv('csv_file')
#all possible combinations of three variables
combs = [comb for comb in itr.combinations(range(1, 101), 3)]
target = data.iloc[:,0]
hi_R2 = 0
for comb in combs:
variables = data.iloc[:, comb]
R2 = LR.fit(variables, target).score(variables, target)
if R2 > hi_R2:
hi_R2 = R2
indices = comb
end = t.time()
time = float((end-start)/60)
print 'Variables: {}\nR2 = {:.2f}\nTime: {:.1f} mins'.format(indices, hi_R2, time)
It took 4.3 mins to complete. I believe this method is not efficient for data set with thousands observations for each variable. What method would you suggest instead?
Thank you.