Normalise numpy array / occurrence matrix "across the diagonal"

Question

I'm trying to make an normalise a co-occurrence matrix (I supposed it's called?) I have the following data sample coming in from a csv file:

import pandas as pd

df = pd.DataFrame({'A':[1,1,1,0,1,1,1,1],
                    'B':[1,0,1,0,1,1,1,1],
                    'C':[0,1,0,1,1,0,1,1],
                    'D':[1,1,1,1,0,1,1,1],
                    'E':[0,1,1,1,1,1,1,0]})

... and I have used the following approach to create this matrix: (Constructing a co-occurrence matrix in python pandas)

df_asint = df.astype(int)
coocc = df_asint.T.dot(df_asint)
print(coocc)

Output:

[4975 rows x 5 columns]
   A  B  C  D  E
A  7  6  4  6  5
B  6  6  3  5  4
C  4  3  5  4  4
D  6  5  4  7  5
E  5  4  4  5  6

Now the problem. I'm trying to normalise these to the diagonal. I have solved it using Excel, as you can see on the screenshot.

Any thoughts on how to do this in pandas?

score 2 · Accepted Answer · answered Nov 04 '21 at 17:05

Use numpy:

import numpy as np

>>> coocc.divide(np.diag(coocc))

          A         B    C         D         E
A  1.000000  1.000000  0.8  0.857143  0.833333
B  0.857143  1.000000  0.6  0.714286  0.666667
C  0.571429  0.500000  1.0  0.571429  0.666667
D  0.857143  0.833333  0.8  1.000000  0.833333
E  0.714286  0.666667  0.8  0.714286  1.000000

If you want to force the upper-diagonal values to zero, you can do:

>>> pd.DataFrame(np.tril(coocc.divide(np.diag(coocc))), columns=coocc.columns, index=coocc.index)

          A         B    C         D    E
A  1.000000  0.000000  0.0  0.000000  0.0
B  0.857143  1.000000  0.0  0.000000  0.0
C  0.571429  0.500000  1.0  0.000000  0.0
D  0.857143  0.833333  0.8  1.000000  0.0
E  0.714286  0.666667  0.8  0.714286  1.0

Normalise numpy array / occurrence matrix "across the diagonal"

1 Answers1