3

In the linear model = 0 + 1 × i + 2 × j + 3 × k + , what values for ,j,k ∈ [1,100] results in the model with the highest R-Squared?

The data set consists of 100 independent variables and one dependent variable. Each variable has 50 observations.

My only guess is to loop through all possible combinations of three variables and compare R-squared for each combination. The way I have done it with Python is:

import itertools as itr
import pandas as pd
import time as t
from sklearn import linear_model as lm

start = t.time()

#linear regression model 
LR = lm.LinearRegression()

#import data
data = pd.read_csv('csv_file')

#all possible combinations of three variables
combs = [comb for comb in itr.combinations(range(1, 101), 3)]

target = data.iloc[:,0]
hi_R2 = 0

for comb in combs:
    variables = data.iloc[:, comb]
    R2 = LR.fit(variables, target).score(variables, target)
    if R2 > hi_R2:
        hi_R2 = R2
        indices = comb
end = t.time()
time = float((end-start)/60)

print 'Variables: {}\nR2 = {:.2f}\nTime: {:.1f} mins'.format(indices, hi_R2, time)

It took 4.3 mins to complete. I believe this method is not efficient for data set with thousands observations for each variable. What method would you suggest instead?

Thank you.

antdro
  • 31
  • 4
  • do you mean lowest MSE? Plus this question is for Code Review, since your code does run and you are trying to make it more efficient. post it there please (http://codereview.stackexchange.com/questions/tagged/python) – Ma0 Jun 29 '16 at 11:18
  • It may also be a question for http://stats.stackexchange.com, because it is a common problem unrelated to Python. Look for "predictor selection" or this wikipedia article: https://en.wikipedia.org/wiki/Stepwise_regression as one example "solution". – StefanS Jun 29 '16 at 11:44
  • Ev. Kounis, I am looking for three variables best explaining variation in target. I would appreciate your comments/links on why MSE is better than R-squared for this purpose. Thank you for suggestion to post this question in Code Review. Shall I delete this question here? StefanS, thank you for the link to Stepwise regression. – antdro Jun 29 '16 at 13:34
  • I think you need to find the statistical method you want to use first (unless you want to stay with brute force) and once you know that, a Python implementation may (or may not) be a simple web search away. The first part is most likely the harder problem to solve. – StefanS Jun 29 '16 at 13:40

1 Answers1

1

Exhaustive search is going to be the slowest way of doing this

The fastest way to do this is mentioned in one of the comments. You should pre-specify your model based on theory/intuition/logic and come up with a set of variables that you hypothesize will be good predictors of your outcome.

The difference between the 2 extremes is that exhaustive search may leave you with a model that doesn't make sense as it will use whatever variables it has access to, even if its completely unrelated to your question of interest

If, however, you dont want to specify a model and still want to use an automated technique to build the "best" model, a middle ground might be something like stepwise regression

There are a few different ways of doing this (e.g. forward/backward elimination), but in the case of forward selection, for example, you start by adding in one variable at a time and testing the coefficient for significance. If the variables improves model fit (either determined throught he individual regression coefficient, or the R2 of the model) you keep it and add another. If it doesnt aid prediction then you throw it away. Repeat this process until you've found your best predictors

Simon
  • 9,762
  • 15
  • 62
  • 119