as part of an automatic feature selection pipeline, I am wanting to automatically remove features if they are highly correlated, based on its Variable Inflation Factor (VIF) result. Below is an example of df (although the data I use often have >200 features), vif calculation function, and while loop to continue until all VIF results are less than 100.
The while loop breaks if vif_data['VIF'].iloc[1] is an inf value. In my example, there are a small number of rows with inf values, however, the issue is that there are features below these rows that have VIF values above 100 - the while loop breaks because the first row VIF value is inf and not a number.
from statsmodels.stats.outliers_influence import variance_inflation_factor
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 4],
'b': [2, 2, 3, 2, 1],
'c': [4, 6, 7, 8, 9],
'd': [4, 3, 4, 5, 4]}
)
def compute_vif(df)
df = df.select_dtypes(include='number').apply(pd.to_numeric)
features = df.columns.to_list()
vif_data = pd.DataFrame()
vif_data['feature'] = features
vif_data['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
return vif_data.sort_values(by=['VIF'], ascending=False)
vif_data = compute_vif(df)
multicollinearity = True
while multicollinearity:
highest_vif_feature = vif_data['feature'].values[-1]
print('Removing: ', highest_vif_feature)
df = df.drop(highest_vif_feature, axis=1)
vif_data = compute_vif(df)
multicollinearity = False if vif_data['VIF'].iloc[1] > 100 else True
How can I fix the last line of the while loop to address this?