1

I am really new to Python so I don't know a lot of basics, but I have a college report that has to be done in Python and I'm struggling with figuring out how to resolve the issues in my code.

First I created my training data for X and y, then transformed it into a pandas DataFrame so I can call ols from statsmodels on it for my initial model. Now I want to use rfe to reduce my model, and I'm starting with RFECV so I can determine how many features I want RFE to select. But every time I run the code I have an issue with rfecv.fit().

Here is my code:

'''

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
pipe = make_pipeline(StandardScaler())
pipe.fit(X_train, y_train)

#recombine training dataset in order to call ols
X_train_df = pd.DataFrame(X_train)
y_train_df = pd.DataFrame(y_train)
traindata = pd.concat([X_train_df.reset_index(drop=True), y_train_df.reset_index(drop=True)], axis=1)


#create first linear model
from statsmodels.formula.api import ols
model1 = ols('Tenure ~ Population + Children + Age + Income + Outage_sec_perweek + Email + Contacts + Yearly_equip_failure + MonthlyCharge + Bandwidth_GB_Year + Area_Suburban + Area_Urban + Marital_Married + Marital_Never_Married + Marital_Separated + Marital_Widowed + Gender_Male + Gender_Nonbinary + Churn_Yes + Contract_One_Year + Contract_Two_Year', data=traindata).fit()
print(model1.params)

#RFECV to determine number of variables to include for the optimal model
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

svc = SVC(kernel="linear")
rfecv = RFECV(estimator = svc, step = 1, cv = StratifiedKFold, scoring = "accuracy")
rfecv.fit(X_train_df, y_train_df)

'''

The output error looks like this: TypeError: Singleton array array(None, dtype=object) cannot be considered a valid collection.

Any help or resources would be really appreciated! Thanks

1 Answers1

0

You need to pass cv = StratifiedKFold() instead of cv = StratifiedKFold, so the below will work:

rfecv = RFECV(estimator = svc, step = 1, cv = StratifiedKFold(), scoring = "accuracy")

Or if you want 10 folds (default is 5):

rfecv = RFECV(estimator = svc, step = 1, cv = StratifiedKFold(n_splits=10), scoring = "accuracy")

You can check out the difference between having / not having the parenthesis from post like this or this.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72