Remove highly correlated column in numpy (without pandas)

Question

i have 2D numpy array

[[1 3 4 2]
 [2 4 6 4]
 [-1 6 8 -2]
 [6 4 2 12]]

i want to remove higly correated column, the result should be like this:

 [[1 3 4 ]
 [2 4 6 ]
 [-1 6 8]
 [6 4 2 ]]

see ? column 4 is removed because it's highly correlated to column 1

I can get correlation matrix

np.corrcoef(numpy_array)

The question is how to drop column that have high correlation?

I've searched the solution but only get solution that use Pandas dataframe. For some reason I don't want to use pandas. I want solution that only use numpy

You may want to take a look at this: https://stackoverflow.com/questions/29294983/how-to-calculate-correlation-between-all-columns-and-remove-highly-correlated-on — Scott, Sep 06 '19 at 07:27
i want to remove highly correlated column automatically , not manually remove column by column — Hjin, Sep 06 '19 at 07:40

score 1 · Accepted Answer · answered Sep 06 '19 at 08:09

We will make use of corr2_coeff to get the pairwise correlation values across all columns and then pick out the pairs that have correlation values of 1 for perfectly correlated columns.

Hence, the steps would look something along these -

In [47]: a # Input array
Out[47]: 
array([[ 1,  3,  4,  2],
       [ 2,  4,  6,  4],
       [-1,  6,  8, -2],
       [ 6,  4,  2, 12]])

# Get correlation values 
In [48]: cor = corr2_coeff(a.T,a.T)

In [49]: cor
Out[49]: 
array([[ 1.        , -0.44992127, -0.87705802,  1.        ],
       [-0.44992127,  1.        ,  0.71818485, -0.44992127],
       [-0.87705802,  0.71818485,  1.        , -0.87705802],
       [ 1.        , -0.44992127, -0.87705802,  1.        ]])

# Get pairs, which are the ones that are forming perfect correlation
In [53]: p = np.argwhere(np.triu(np.isclose(corr2_coeff(a.T,a.T),1),1))

In [54]: p
Out[54]: array([[0, 3]])

# Delete those cols
In [51]: np.delete(a,p[:,1],axis=1)
Out[51]: 
array([[ 1,  3,  4],
       [ 2,  4,  6],
       [-1,  6,  8],
       [ 6,  4,  2]])

Remove highly correlated column in numpy (without pandas)

1 Answers1