I am working on an academic project where the aim is to understand which features predict a binary outcome. I have therefore used a logistic model. However, there is a problem with colinearity as several of the features are highly correlated. I was planning to implement L1 regularisation / Lasso to find the best features for the model. I have read up on this but was looking for some advice on the best implementation in python.
My current code:
DM.shape #design matrix shape (396, 7) - 6 predictors, with an initial column with just a constant term
scaler = StandardScaler() #use standard scaler from Sklearn
scaler.fit(DM[:, 1:]) # apply to the whole dataset
ScaledDM = np.hstack((DM[:, 0].reshape(-1, 1), scaler.transform(DM[:, 1:])))
AlphasToTest = np.arange(0.01,10,0.01) #A grid of possible alpha values to use for cross-validation.
log_likelihood = np.zeros(len(AlphasToTest))
for store,aa in enumerate(AlphasToTest): #loop over potential alpha values
X_train, X_test, Y_train, Y_test = train_test_split(ScaledDM, Y, train_size=0.8) #split into train and test for each iteration
lasso_model = Logit(Y_train, X_train)
lasso_result = lasso_model.fit_regularized(method='l1', alpha=aa,disp=0)
# Predict on the test set
y_pred = lasso_result.predict(X_test) # Predicted probabilities
log_likelihood[store] = np.sum(np.log(y_pred) * Y_test + np.log(1 - y_pred) * (1 - Y_test))
I would then choose the model which has the best scoring log-likelihood.
Does this seem to be an appropriate implementation? Is it appropriate to apply the scaler to the whole dataset before subdividing into training and test? Is this a good way to determine the best value of alpha, and hence the best features for the model? Thanks for your help
N.B. Following helpful suggestions, code has been updated to as follows:
X_train, X_test, Y_train, Y_test = train_test_split(DM, Y, train_size=0.8)
AlphasToTest = np.arange(0.01, 10, 0.01)
scaler = StandardScaler()
scaler.fit(X_train[:, 1:])
ScaledDM = scaler.transform(X_train[:, 1:])
lasso_model = LogisticRegressionCV(
penalty='l1',
solver='saga',
Cs=AlphasToTest,
cv=5,
scoring='neg_log_loss',
max_iter=1000
)
lasso_model.fit(ScaledDM, Y_train.ravel())
best_alpha = lasso_model.C_[0] # Access the best alpha value chosen by the model
# Scale the test data using the same scaler
ScaledDM_test = scaler.transform(X_test[:, 1:])
# Fit the model on the scaled test data using the best alpha value
lasso_model_best = LogisticRegression(
penalty='l1',
solver='saga',
C=best_alpha,
max_iter=1000
)
lasso_model_best.fit(ScaledDM_test, Y_test.ravel())
# Access the estimated coefficients and p-values
coef = lasso_model_best.coef_
However, I now have the issue of how to get p-values / best way to interpret the output of lasso_model_best