2

I have a co-occurence matrix in pyspark for co-ocurrences of certain keywords A,B,C

    A   B   C
A   5   1   0
B   1   3   2
C   0   2   3

How can I calculate the jaccard similarity from this matrix in Python for all keywords. Is there any library available to do that or should I simply compute the similarity by using the Jaccard similarity formula?

secret.shadow
  • 77
  • 1
  • 1
  • 12
  • 1
    Why do want to use PySpark for this task instead of pure Python (or [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html))? Is your matrix so large that you have to use some kind of parallelism? – werner Aug 23 '23 at 19:32
  • I see you added the [tag:pyspark] tag after I answered. While I am not familiar enough with pyspark to use it in my answer, I expect the approach is the same as what I show with numpy -- Get the values of the diagonal elements, tile it vertically and horizontally, add the two matrices, and then subtract the original matrix from it. This gives you the values of `|A ∪ B|`, then you need to divide the original matrix by this matrix for the Jaccard similarity. – Pranav Hosangadi Aug 25 '23 at 17:50
  • Did you have a look at [this](https://stackoverflow.com/questions/52923110/spark-python-how-to-calculate-jaccard-similarity-between-each-line-within-an-rd)? – Let's try Aug 29 '23 at 18:50

1 Answers1

5

Let's assume the co-occurrence matrix is given as a list of lists:

com = [[5, 1, 0], 
       [1, 3, 2],
       [0, 2, 3]]
n_elem = len(com)

The Jaccard similarity of two sets A and B is given by |A ∩ B| / |A ∪ B|. The co-occurrence matrix gives the value of |A|, |B|, and |A ∩ B|. The value of |A ∪ B| is simply |A| + |B| - |A ∩ B|, which we can find the Jaccard index.

First, let's create a list of lists containing ones that is the same size as com. The default value is 1 because the similarity index of a set with itself is 1, and we will not calculate these elements:

similarity = [[1 for _ in row] for row in com]

Now, we can loop over each pair of values in com and calculate the similarities. The inner loop starts at i+1 because similarity[i][j] is identical to similarity[j][i], so we only need to calculate the upper triangle of the matrix:

for i in range(n_elem):
    a = com[i][i]               # |A|
    for j in range(i+1, n_elem):
        b = com[j][j]           # |B|
        aib = com[i][j]         # |A ∩ B|
        aub = a + b - aib       # |A ∪ B|
        # Set both off-diagonal elements simultaneously
        similarity[i][j] = similarity[j][i] = aib / aub

This leaves us with the following similarity matrix:

[[1                  , 0.14285714285714285, 0.0], 
 [0.14285714285714285, 1                  , 0.5], 
 [0.0                , 0.5                , 1]]

Now, if your co-occurrence matrix is a numpy array (or you're open to using numpy), you can speed up this computation by outsourcing the loops to numpy's C backend.

import numpy as np
com_arr = np.array([[5, 1, 0], 
       [1, 3, 2],
       [0, 2, 3]])
n_elem = com_arr.size

First, we can get the occurrence of each element using the diagonal of the matrix:

occ = np.diag(com_arr) # array([5, 3, 3])

Next, create the matrix of |A ∪ B|. Remember that |A ∩ B| is already specified by com_arr:

aub = occ[:, None] + occ[None, :] - com_arr

Since occ is a 1-d array, adding a None index will create a 2-d array of one column (a column vector of shape (3, 1)) and one row (a row vector of shape (1, 3)) respectively. When adding a row vector to a column vector, numpy automatically broadcasts the dimensions so that you end up with a (in this case) square matrix of shape (3, 3). Now, aub looks like this:

array([[5, 7, 8],
       [7, 3, 4],
       [8, 4, 3]])

Finally, divide the intersection by the union:

similarity = com_arr / aub

et voila, we have the same values as before:

array([[1.        , 0.14285714, 0.        ],
       [0.14285714, 1.        , 0.5       ],
       [0.        , 0.5       , 1.        ]])
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70