1

i have 2D numpy array

[[1 3 4 2]
 [2 4 6 4]
 [-1 6 8 -2]
 [6 4 2 12]]

i want to remove higly correated column, the result should be like this:

 [[1 3 4 ]
 [2 4 6 ]
 [-1 6 8]
 [6 4 2 ]]

see ? column 4 is removed because it's highly correlated to column 1

I can get correlation matrix

np.corrcoef(numpy_array)

The question is how to drop column that have high correlation?

I've searched the solution but only get solution that use Pandas dataframe. For some reason I don't want to use pandas. I want solution that only use numpy

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Hjin
  • 320
  • 1
  • 11

1 Answers1

1

We will make use of corr2_coeff to get the pairwise correlation values across all columns and then pick out the pairs that have correlation values of 1 for perfectly correlated columns.

Hence, the steps would look something along these -

In [47]: a # Input array
Out[47]: 
array([[ 1,  3,  4,  2],
       [ 2,  4,  6,  4],
       [-1,  6,  8, -2],
       [ 6,  4,  2, 12]])

# Get correlation values 
In [48]: cor = corr2_coeff(a.T,a.T)

In [49]: cor
Out[49]: 
array([[ 1.        , -0.44992127, -0.87705802,  1.        ],
       [-0.44992127,  1.        ,  0.71818485, -0.44992127],
       [-0.87705802,  0.71818485,  1.        , -0.87705802],
       [ 1.        , -0.44992127, -0.87705802,  1.        ]])

# Get pairs, which are the ones that are forming perfect correlation
In [53]: p = np.argwhere(np.triu(np.isclose(corr2_coeff(a.T,a.T),1),1))

In [54]: p
Out[54]: array([[0, 3]])

# Delete those cols
In [51]: np.delete(a,p[:,1],axis=1)
Out[51]: 
array([[ 1,  3,  4],
       [ 2,  4,  6],
       [-1,  6,  8],
       [ 6,  4,  2]])
Divakar
  • 218,885
  • 19
  • 262
  • 358