Variable selection in big data

Question

I am trying to build a regression model for big data with 220 variables. The 220 variables have binary values with values as zero and one. Some variables are correlated (not highly correlated). Also, some of the variables have 60% or more of their data as zero values. The zeros are not an indication of missing values, they are just values.
My main goal is to identify the most important variables. What is the best approach for variable selection?

score 0 · Answer 1 · answered Apr 25 '23 at 20:59

To find the most important variables, you can use several different variable selection algorithms. The output of these algorithms can be combined or you can use them separately. Some of the variable selection algorithms are as follows:

LASSO Regression
Ridge Regression
ElasticNet Regression
Recursive Feature Elimination
Tree-based Models such as Random Forest, Gradient Boosting, etc.

Additionally, if you have some domain knowledge about the data, you can eliminate variables that are less relevant to the target variable.

Variable selection in big data

1 Answers1