I could have chosen to go for a shorter question that only focuses on the core-problem here that is list permutations. But the reason I'm bringing statsmodels and pandas into the question is that there may exist specific tools for step-wise regression that at the same time has the flexibilty of storing the desired regression output like I'm about to show you below, but that are much more efficient. At least I hope so.
Given a dataframe like below:
Code snippet 1:
# Imports
import pandas as pd
import numpy as np
import itertools
import statsmodels.api as sm
# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df_1 = df_1.set_index(rng)
print(df_1)
Screenshot 1:
I'd like to run several regression anlyses on the dependent variable y using multiple combinations of the independent variables x1, x2 and x3. In other words, this is a step-wise regression analysis where y is tested against x1, and then x2 and x3 consequtively. Then y is tested against the set of x1 AND x2, and so on like this:
- ['y', ['x1']]
- ['y', ['x2']]
- ['y', ['x3']]
- ['y', ['x1', 'x2']]
- ['y', ['x1', 'x2', 'x3']]
My inefficient approach:
In the two first snippet belows, I'm able to do exactly this by hardcoding the execution sequence using a list of lists.
Here are the subsets of listVars:
Code snippet 2:
listExec = [[listVars[0], listVars[1:2]],
[listVars[0], listVars[2:3]],
[listVars[0], listVars[3:4]],
[listVars[0], listVars[1:3]],
[listVars[0], listVars[1:4]]]
for l in listExec:
print(l)
Screenshot 2:
With listExec I can set up a procedure for regression analysis and get store a bunch of results (rsquared or the entire model output mode.summary()) in a list like this:
Code snippet 3 :
allResults = []
for l in listExec:
x = listVars[1]
x = sm.add_constant(df_1[l[1]])
model = sm.OLS(df_1[l[0]], x).fit()
result = model.rsquared
allResults.append(result)
print(allResults)
Screenshot 3:
And this is pretty awsome, but horribly inefficient for longer lists of variables.
My attempt of list combinations:
Following the suggestions from How to generate all permutations of a list in Python and Convert a list of tuples to a list of lists I'm able to set up a combination of ALL variables like this:
Code snippet 4:
allTuples = list(itertools.permutations(listVars))
allCombos = [list(elem) for elem in allTuples]
Screenshot 4:
And that's a lot of fun, but does not give me the stepwise approach that I'm after. Anyway, I hope some of you find this interesting.
Thank you for any suggestions!