0

Hi I build a correlationmatrix and want to delete all labels and values under XX I am building a series so i can iterate over the values but i dont know how to delete the raws. the next step is to convert it back to a dataframe. Maybe you know a better way.

Here a mini-example

import numpy as np
import pandas as pd  

data = np.random.rand(4,4)
df = pd.DataFrame(data, index = ['varname1', 'varname2', 'varname3', 'varname4'], 
                  columns = ['longname1', 'longname2', 'longname3', 'longname4'])

corr = abs(df.corr().stack())
corr = corr[corr.index.get_level_values(0) != corr.index.get_level_values(1)] #delete doubles

for i in range(len(corr.keys())):
    if corr[i] <= 0.2:
        corr = corr.drop(corr[i]) # how can i delete the raws
credenco
  • 255
  • 2
  • 12

2 Answers2

2

You can chain another mask by & for bitwise AND and filter by boolean indexing, for reshape back add Series.unstack:

np.random.seed(2020)
data = np.random.rand(4,4)
df = pd.DataFrame(data, index = ['varname1', 'varname2', 'varname3', 'varname4'], 
                  columns = ['longname1', 'longname2', 'longname3', 'longname4'])

print (df)
          longname1  longname2  longname3  longname4
varname1   0.986277   0.873392   0.509746   0.271836
varname2   0.336919   0.216954   0.276477   0.343316
varname3   0.862159   0.156700   0.140887   0.757080
varname4   0.736325   0.355663   0.341093   0.666803

corr = df.corr().stack().abs()
m1 = corr.index.get_level_values(0) != corr.index.get_level_values(1)
m2 = corr > 0.2
corr = corr[m1 & m2].unstack()
print (corr)
           longname1  longname2  longname3  longname4
longname1        NaN   0.584300   0.326267        NaN
longname2   0.584300        NaN   0.937580   0.641093
longname3   0.326267   0.937580        NaN   0.720851
longname4        NaN   0.641093   0.720851        NaN

Another idea is replace missing values by DataFrame.where and then fill diagonal values by NaN by this solution

df1 = df.corr().abs()
df1 = df1.where(df1 > 0.2)
np.fill_diagonal(df1.values, np.nan)
print (df1)
           longname1  longname2  longname3  longname4
longname1        NaN   0.584300   0.326267        NaN
longname2   0.584300        NaN   0.937580   0.641093
longname3   0.326267   0.937580        NaN   0.720851
longname4        NaN   0.641093   0.720851        NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • ty, this is exactly what i am looking for. but can you explain me plz what [m1 & m2] is doing. is it comparing the 2 series? can you plz give me llink to the documentation :) – credenco Jan 31 '20 at 12:28
  • @credenco - link for boolean indexing is [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) – jezrael Jan 31 '20 at 12:34
1

you have almost got it. simply change your last code from

for i in range(len(corr.keys())):
    if corr[i] <= 0.2:
        corr = corr.drop(corr[i]) # how can i delete the raws

to


corr = corr[corr > 0.2]

then you get it

Andy
  • 58
  • 7