0

I had wanted to filter 1 and -1 correlations from my dataframe but I have realised that some of the correlations slightly bigger than 1 and lower than -1.

I couldn't find what is the real reason behind that. I have put subset of the dataframe here.

ModelID      RPL23A         ST13
ACH-001196  -4.384573196    0.025759764
ACH-000054  -4.384573196    0.025759764
ACH-001050  -4.384573196    0.025759764
ACH-000505  -4.81558301     0.44097594
ACH-001794  -4.384573196    0.025759764

example code:

df = pd.read_csv('test.csv',index_col=0)
corr = df.corr()

and if I want to filter 1 correlations, I can not

corr[corr == -1.0]

enter image description here

and I checked what is the reason behind that, it seems they are not -1

corr.stack().reset_index().astype(str)

enter image description here

tyasird
  • 926
  • 1
  • 12
  • 29
  • Floating point math is imprecise: see [this answer](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) for more details. – Seon Aug 09 '23 at 14:12
  • @not_speshal I don't understand why you closed the topic, I know that floating point math is imprecise but I wanted to emphasize that correlation function should round this, you calculate correlation with pandas and then you can not filter this? – tyasird Aug 15 '23 at 10:04
  • It's still just a numerical precision issue. See [this](https://github.com/pandas-dev/pandas/issues/35135). – not_speshal Aug 15 '23 at 13:15

1 Answers1

0

Instead of only looking at correlations equal to EXACTLY -1 you could filter out any that are less than or equal to -1. For example:

corr[column for column in corr.columns if any(corr[column] <= -1)]

This would catch the cases where your correlation is -1.00007 (probably due to overflow). You can also do this technique within a range of a threshold