1

I am working on an academic project where the aim is to understand which features predict a binary outcome. I have therefore used a logistic model. However, there is a problem with colinearity as several of the features are highly correlated. I was planning to implement L1 regularisation / Lasso to find the best features for the model. I have read up on this but was looking for some advice on the best implementation in python.

My current code:


DM.shape  #design matrix shape (396, 7) - 6 predictors, with an initial column with just a constant term
scaler = StandardScaler() #use standard scaler from Sklearn
scaler.fit(DM[:, 1:]) # apply to the whole dataset
ScaledDM = np.hstack((DM[:, 0].reshape(-1, 1), scaler.transform(DM[:, 1:])))

AlphasToTest = np.arange(0.01,10,0.01) #A grid of possible alpha values to use for cross-validation. 
log_likelihood = np.zeros(len(AlphasToTest))

for store,aa in enumerate(AlphasToTest): #loop over potential alpha values
         X_train, X_test, Y_train, Y_test = train_test_split(ScaledDM, Y, train_size=0.8) #split into train and test for each iteration
         lasso_model = Logit(Y_train, X_train)
         lasso_result = lasso_model.fit_regularized(method='l1', alpha=aa,disp=0)
                
         # Predict on the test set
         y_pred = lasso_result.predict(X_test)  # Predicted probabilities
         log_likelihood[store] = np.sum(np.log(y_pred) * Y_test + np.log(1 - y_pred) * (1 - Y_test))

I would then choose the model which has the best scoring log-likelihood.

Does this seem to be an appropriate implementation? Is it appropriate to apply the scaler to the whole dataset before subdividing into training and test? Is this a good way to determine the best value of alpha, and hence the best features for the model? Thanks for your help

N.B. Following helpful suggestions, code has been updated to as follows:


            X_train, X_test, Y_train, Y_test = train_test_split(DM, Y, train_size=0.8)
            AlphasToTest = np.arange(0.01, 10, 0.01)
            scaler = StandardScaler()
            scaler.fit(X_train[:, 1:])
            ScaledDM = scaler.transform(X_train[:, 1:])
            lasso_model = LogisticRegressionCV(
                penalty='l1',
                solver='saga',
                Cs=AlphasToTest,
                cv=5,
                scoring='neg_log_loss',
                max_iter=1000
            )
            lasso_model.fit(ScaledDM, Y_train.ravel())
            best_alpha = lasso_model.C_[0] # Access the best alpha value chosen by the model

            # Scale the test data using the same scaler
            ScaledDM_test = scaler.transform(X_test[:, 1:])
            # Fit the model on the scaled test data using the best alpha value
            lasso_model_best = LogisticRegression(
                penalty='l1',
                solver='saga',
                C=best_alpha,
                max_iter=1000
            )
            lasso_model_best.fit(ScaledDM_test, Y_test.ravel())
            # Access the estimated coefficients and p-values
            coef = lasso_model_best.coef_

However, I now have the issue of how to get p-values / best way to interpret the output of lasso_model_best

SeanC
  • 37
  • 3

1 Answers1

0

It is not appropriate to use a single StandardScaler before splitting into train and test dataset, for more details you can read my answer to this question

Regarding your code, you should use cross validation to assess which of the value of alpha would be the best. When using only one train/test split, your parameter selection is sensitive on the way the data is split and you might select a model with a poorer generalization power than what you estimate. You can think of it as 'overfitting' on your train/test split

You can use scikit-learn to automate most of this process, provided the Logit model you use is compatible with it (I would say it must implement at least fit, predict and score methods for this use case - with the appropriate parameters, see more here). Or use scikit-learn's Lasso implementation, it will be simpler. Then follow these steps:

  1. Create a Pipeline object to stack your StandarScaler and Lasso model into one estimator - this way the both the StandardScaler and the model will be fit on the training dataset only.
  2. Use GridSearchCV with the log_loss function as scoring, and your grid value for alpha as param_grid. The log_loss is the negative log lokelihood, and the GridSearchCV will seek to maximize it, hence minimizing your log likelihood. By default the parameter selection is done in a cross validated manner with 5 folds and will be more robust.

You can also implement this strategy by hand of course

A Co
  • 908
  • 6
  • 15
  • Thanks for the suggestions, that was very helpful. I have updated the code (see above). I think this now works well and incorporates your suggestions - however I was planning to extract p-values from 'lasso_model_best' so the coefficients can be interpreted. However, when using scikit-learn this isn't possible. Do you know how I could do this, or another appropriate way to interpret which predictors of 'lasso_model_best' are worthwhile given that I have now established the best alpha on the training set? Thanks again – SeanC Jul 17 '23 at 20:00
  • You can wrap the original model you wanted to use in a custom class making it compatible with the `Pipeline` and `GridSearchCV` estimators from scikit learn. What is the model you wanted to use originally (which package)? – A Co Jul 19 '23 at 12:10