How do I create a co-occurrance matrix in Python?

Question

I have a dataframe of N columns. Each element in the dataframe is in the range 0, N-1.

For example, my dataframce can be something like (N=3):

    A   B   C
0   0   2   0
1   1   0   1
2   2   2   0
3   2   0   0
4   0   0   0

I want to create a co-occurrence matrix (please correct me if there is a different standard name for that) of size N x N which each element ij contains the number of times that element i and j assume the same value.

    A   B   C
A   x   2   3
B   2   x   2
C   3   2   x

Where, for example, matrix[0, 1] means that A and B assume the same value 2 times. I don't care about the value on the diagonal.

What is the smartest way to do that?

does [constructing-a-co-occurrence-matrix-in-python-pandast](https://stackoverflow.com/questions/20574257/constructing-a-co-occurrence-matrix-in-python-pandas) answer your question? I'm not familiar with this so unsure to close as a dupe but let me know if it is the case. — Umar.H, May 14 '21 at 16:04

score 2 · Accepted Answer · answered May 14 '21 at 16:23

`DataFrame.corr`

We can define a custom callable function for calculating the correlation between the columns of the dataframe, this callable takes two 1D numpy arrays as its input arguments and return's the count of the number of times the elements in these two arrays equal to each other

df.corr(method=lambda x, y: (x==y).sum())

     A    B    C
A  1.0  2.0  3.0
B  2.0  1.0  2.0
C  3.0  2.0  1.0

score 1 · Answer 2 · answered May 14 '21 at 16:22

Let's try broadcasting across the transposition and summing axis 2:

import pandas as pd

df = pd.DataFrame({
    'A': {0: 0, 1: 1, 2: 2, 3: 2, 4: 0},
    'B': {0: 2, 1: 0, 2: 2, 3: 0, 4: 0},
    'C': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}
})

vals = df.T.values
e = (vals[:, None] == vals).sum(axis=2)

new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
print(new_df)

e:

[[5 2 3]
 [2 5 2]
 [3 2 5]]

Turn back into a dataframe:

new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)

new_df:

score 0 · Answer 3 · answered May 14 '21 at 16:19

I don't know about the smartest way but I think this works:

import numpy as np

m = np.array([[0, 2, 0], [1, 0, 1], [2, 2, 0], [2, 0, 0], [0, 0, 0]])
n = 3

ans = np.zeros((n, n))
for i in range(n):
    for j in range(i+1, n):
        ans[i, j] = len(m) - np.count_nonzero(m[:, i] - m[:, j])

print(ans + ans.T)

How do I create a co-occurrance matrix in Python?

3 Answers3

DataFrame.corr

`DataFrame.corr`