2

I have a dataframe of N columns. Each element in the dataframe is in the range 0, N-1.

For example, my dataframce can be something like (N=3):

    A   B   C
0   0   2   0
1   1   0   1
2   2   2   0
3   2   0   0
4   0   0   0

I want to create a co-occurrence matrix (please correct me if there is a different standard name for that) of size N x N which each element ij contains the number of times that element i and j assume the same value.

    A   B   C
A   x   2   3
B   2   x   2
C   3   2   x

Where, for example, matrix[0, 1] means that A and B assume the same value 2 times. I don't care about the value on the diagonal.

What is the smartest way to do that?

ImAUser
  • 119
  • 1
  • 9
  • does [constructing-a-co-occurrence-matrix-in-python-pandast](https://stackoverflow.com/questions/20574257/constructing-a-co-occurrence-matrix-in-python-pandas) answer your question? I'm not familiar with this so unsure to close as a dupe but let me know if it is the case. – Umar.H May 14 '21 at 16:04

3 Answers3

2

DataFrame.corr

We can define a custom callable function for calculating the correlation between the columns of the dataframe, this callable takes two 1D numpy arrays as its input arguments and return's the count of the number of times the elements in these two arrays equal to each other

df.corr(method=lambda x, y: (x==y).sum())

     A    B    C
A  1.0  2.0  3.0
B  2.0  1.0  2.0
C  3.0  2.0  1.0
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
1

Let's try broadcasting across the transposition and summing axis 2:

import pandas as pd

df = pd.DataFrame({
    'A': {0: 0, 1: 1, 2: 2, 3: 2, 4: 0},
    'B': {0: 2, 1: 0, 2: 2, 3: 0, 4: 0},
    'C': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}
})

vals = df.T.values
e = (vals[:, None] == vals).sum(axis=2)

new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
print(new_df)

e:

[[5 2 3]
 [2 5 2]
 [3 2 5]]

Turn back into a dataframe:

new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)

new_df:

   A  B  C
A  5  2  3
B  2  5  2
C  3  2  5
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
0

I don't know about the smartest way but I think this works:

import numpy as np

m = np.array([[0, 2, 0], [1, 0, 1], [2, 2, 0], [2, 0, 0], [0, 0, 0]])
n = 3

ans = np.zeros((n, n))
for i in range(n):
    for j in range(i+1, n):
        ans[i, j] = len(m) - np.count_nonzero(m[:, i] - m[:, j])

print(ans + ans.T)
aaronn
  • 448
  • 1
  • 6
  • 16