Let's assume the co-occurrence matrix is given as a list of lists:
com = [[5, 1, 0],
[1, 3, 2],
[0, 2, 3]]
n_elem = len(com)
The Jaccard similarity of two sets A
and B
is given by |A ∩ B| / |A ∪ B|
. The co-occurrence matrix gives the value of |A|
, |B|
, and |A ∩ B|
. The value of |A ∪ B|
is simply |A| + |B| - |A ∩ B|
, which we can find the Jaccard index.
First, let's create a list of lists containing ones that is the same size as com
. The default value is 1
because the similarity index of a set with itself is 1
, and we will not calculate these elements:
similarity = [[1 for _ in row] for row in com]
Now, we can loop over each pair of values in com
and calculate the similarities. The inner loop starts at i+1
because similarity[i][j]
is identical to similarity[j][i]
, so we only need to calculate the upper triangle of the matrix:
for i in range(n_elem):
a = com[i][i] # |A|
for j in range(i+1, n_elem):
b = com[j][j] # |B|
aib = com[i][j] # |A ∩ B|
aub = a + b - aib # |A ∪ B|
# Set both off-diagonal elements simultaneously
similarity[i][j] = similarity[j][i] = aib / aub
This leaves us with the following similarity
matrix:
[[1 , 0.14285714285714285, 0.0],
[0.14285714285714285, 1 , 0.5],
[0.0 , 0.5 , 1]]
Now, if your co-occurrence matrix is a numpy array (or you're open to using numpy), you can speed up this computation by outsourcing the loops to numpy's C backend.
import numpy as np
com_arr = np.array([[5, 1, 0],
[1, 3, 2],
[0, 2, 3]])
n_elem = com_arr.size
First, we can get the occurrence of each element using the diagonal of the matrix:
occ = np.diag(com_arr) # array([5, 3, 3])
Next, create the matrix of |A ∪ B|
. Remember that |A ∩ B|
is already specified by com_arr
:
aub = occ[:, None] + occ[None, :] - com_arr
Since occ
is a 1-d array, adding a None
index will create a 2-d array of one column (a column vector of shape (3, 1)
) and one row (a row vector of shape (1, 3)
) respectively. When adding a row vector to a column vector, numpy automatically broadcasts the dimensions so that you end up with a (in this case) square matrix of shape (3, 3)
. Now, aub
looks like this:
array([[5, 7, 8],
[7, 3, 4],
[8, 4, 3]])
Finally, divide the intersection by the union:
similarity = com_arr / aub
et voila, we have the same values as before:
array([[1. , 0.14285714, 0. ],
[0.14285714, 1. , 0.5 ],
[0. , 0.5 , 1. ]])