0

I have tried the run the code below in Python 3.7 to loop through every combination of data columns in the dataframe 'Rawdata' to create a subset of regression models using statsmodel library and returns the best one. The code does not throw up any errors until I run the last line: best_subset(X, Y). It returns : "IndexingError: Too many indexers".

Any idea what's wrong/how to fix?

Would be great if someone can help! Thanks

#Data
Rawdata = pd.read_csv(r'C:\Users\Lucas\Documents\sample.csv')

#Main code
def best_subset(X, Y):
    n_features = X.shape[1]
    subsets = chain.from_iterable(combinations(range(n_features), k+1) for k in range(n_features))
    best_score = -np.inf
    best_subset = None
    for subset in subsets:
        lin_reg = sm.OLS(Y, X.iloc[:, subset]).fit()
        score = lin_reg.rsquared_adj
        if score > best_score:
            best_score, best_subset = score, subset
    return best_subset, best_score

#Define data inputs and call code above
X = Rawdata.iloc[:, 1:10]
Y = Rawdata.iloc[:, 0]

#To return best model
best_subset(X, Y)
EvensF
  • 1,479
  • 1
  • 10
  • 17
user155415
  • 23
  • 3

1 Answers1

0

Your looping variable subset can be a tuple of length n_features. If, for example, the subset is (0, 1), your regression reads as

lin_reg = sm.OLS(Y, X.iloc[:, (0, 1)]).fit()

Pandas does not know how to handle this (see here). One solution is to convert the type of subset from tuple to a list:

for subset in subsets:
    subset = list(subset)
    lin_reg = sm.OLS(Y, X.iloc[:, subset]).fit()
above_c_level
  • 3,579
  • 3
  • 22
  • 37