0

I wrote a function to find the best combination of given dataframe features, f1 score, and auc score using LogisticRegression. The problem is that when I try to pass a list of dataframes combinations, using itertools combinations, LogisticRegression doesn't recognize each combination as its own X variable/ dataframe.

I'm starting with a dataframe of 10 feature columns and 10k rows. When I run the below code I get a "ValueError: X has 10 features, but LogisticRegression is expecting 1 features as input".

def find_best_combination(X, y):
    #initialize variables
    best_f1 = 0
    best_auc = 0
    best_variables = []

    # get all possible combinations of variables
    for i in range(1, X.shape[1]):
        for combination in combinations(X.columns, i):
            X_subset = X[list(combination)]
            logreg = LogisticRegression()
            logreg.fit(X_subset, y)
            y_pred = logreg.predict(X_subset)

            f1 = f1_score(y, y_pred)
            auc = roc_auc_score(y, logreg.predict_proba(X)[:,1])
            # evaluate performance on current combination of variables
            if f1> best_f1 and auc > best_auc:
                best_f1 = f1
                best_auc = auc
                best_variables = combination
    return best_variables, best_f1, best_auc

and the error

C:\Users\katurner\Anaconda3\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- IBE1273_01_11.0
- IBE1273_01_6.0
- IBE7808
- IBE8439_2.0
- IBE8557_7.0
- ...

  warnings.warn(message, FutureWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\2\ipykernel_15932\895415673.py in <module>
----> 1 best_combo = ml.find_best_combination(X,lg_y)
      2 best_combo

~\Documents\Arcadia\modeling_library.py in find_best_combination(X, y)
    176             # print(y_test)
    177             f1 = f1_score(y, y_pred)
--> 178             auc = roc_auc_score(y, logreg.predict_proba(X)[:,1])
    179             # evaluate performance on current combination of variables
    180             if f1> best_f1 and auc > best_auc:

~\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py in predict_proba(self, X)
   1309         )
   1310         if ovr:
-> 1311             return super()._predict_proba_lr(X)
   1312         else:
   1313             decision = self.decision_function(X)

~\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in _predict_proba_lr(self, X)
    459         multiclass is handled by normalizing that over all classes.
    460         """
--> 461         prob = self.decision_function(X)
    462         expit(prob, out=prob)
    463         if prob.ndim == 1:

~\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in decision_function(self, X)
    427         check_is_fitted(self)
    428 
--> 429         X = self._validate_data(X, accept_sparse="csr", reset=False)
    430         scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_
    431         return scores.ravel() if scores.shape[1] == 1 else scores

~\Anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    598 
    599         if not no_val_X and check_params.get("ensure_2d", True):
--> 600             self._check_n_features(X, reset=reset)
    601 
    602         return out

~\Anaconda3\lib\site-packages\sklearn\base.py in _check_n_features(self, X, reset)
    398 
    399         if n_features != self.n_features_in_:
--> 400             raise ValueError(
    401                 f"X has {n_features} features, but {self.__class__.__name__} "
    402                 f"is expecting {self.n_features_in_} features as input."

ValueError: X has 10 features, but LogisticRegression is expecting 1 features as input.

I'm xpecting the function to return a combination of best_variables, and accociated best_f1, best_auc.

I've also tried running the function using train, test, split. When I add train, test, split into the below code the function does run but returns "[], 0, 0" for best_variables, best_f1, best_auc.

def find_best_combination(X, y):
    #initialize variables
    best_f1 = 0
    best_auc = 0
    best_variables = []
    # get all possible combinations of variables
    for i in range(1, X.shape[1]):
        for combination in combinations(X.columns, i):
            X_subset = X[list(combination)]
            X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.2, stratify=y, random_state=73)
            logreg = LogisticRegression()
            logreg.fit(X_train, y_train)
            y_pred = logreg.predict(X_test)
            f1 = f1_score(y_test, y_pred)
            auc = roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1])
            # evaluate performance on current combination of variables
            if f1> best_f1 and auc > best_auc:
                best_f1 = f1
                best_auc = auc
                best_variables = combination
    return best_variables, best_f1, best_auc

I'm not sure what's going on under the hood of train, test, split that enables the function to iterate through and not error like before.

I hope this explains it enough. Thanks in advance for any help.

  • First, you might need to check [feature importance](https://stackoverflow.com/a/34052747/10452700) and dimension reduction keep most important ones and use their combinations. See [this](https://stackoverflow.com/questions/66750706/sklearn-important-features-error-when-using-logistic-regression) – Mario Jan 17 '23 at 23:57
  • Thanks but I've already created a function to determine statically significant features. That is where the 10 features that I'm using came from. – Kyle Turner Jan 18 '23 at 16:24

0 Answers0