Find jaccard similarity among a batch of vectors in PyTorch

Question

I'm having a batch of vectors of shape (bs, m, n) (i.e., bs vectors of dimensions mxn). For each batch, I would like to calculate the Jaccard similarity of the first vector with the rest (m-1) of them

Example:

a = [
    [[3, 8, 6, 8, 7],
    [9, 7, 4, 8, 1],
    [7, 8, 8, 5, 7],
    [3, 9, 9, 4, 4]],

    [[7, 3, 8, 1, 7],
    [3, 0, 3, 4, 2],
    [9, 1, 6, 1, 6],
    [2, 7, 0, 6, 6]]
]

Find pairwise jaccard similarity between a[:,0,:] and a[:,1:,:] i.e.,

[3, 8, 6, 8, 7] with each of [[9, 7, 4, 8, 1], [7, 8, 8, 5, 7], [3, 9, 9, 4, 4]] (3 scores)
and
[7, 3, 8, 1, 7] with each of [[3, 0, 3, 4, 2], [9, 1, 6, 1, 6], [2, 7, 0, 6, 6]] (3 scores)

Here's the Jaccard function I have tried

def js(la1, la2):
    combined = torch.cat((la1, la2))
    union, counts = combined.unique(return_counts=True)
    intersection = union[counts > 1]
    torch.numel(intersection) / torch.numel(union)

While this works with unequal-sized tensors, the problem with this approach is that the number of uniques in each combination (pair of tensors) might be different and since PyTorch doesn't support jagged tensors, I'm unable to process batches of vectors at once.

If I'm not able to express the problem with the expected clarity, do let me know. Any help in this regard would be greatly appreciated

EDIT: Here's the flow achieved by iterating over the 1st and 2nd dimensions. I wish to have a vectorised version of the below code for batch processing

bs = 2
m = 4
n = 5
a = torch.randint(0, 10, (bs, m, n))
print(f"Array is: \n{a}")

for bs_idx in range(bs):
    first = a[bs_idx,0,:]
    for row in range(1, m):
        second = a[bs_idx,row,:]
        idx = js(first, second)
        print(f'comparing{first} and {second}: {idx}')

Sure. For the above example of tensor a, here's what I'm expecting [[0.286, 0.4, 0.667], [0.286, 0.5, 0.286]] I've updated the post to include a for loop version to achieve what I want — helloworld, Jun 17 '22 at 10:21
Can you provide an example when you are unable to process batches of vectors at once? — A.Mounir, Jun 17 '22 at 10:58
For a tensor [[[8, 7, 2, 9, 3], [3, 5, 5, 0, 7], [7, 7, 2, 4, 4]],[[8, 5, 7, 4, 3],[5, 9, 4, 1, 6], [2, 8, 8, 3, 9]]], the problem arises when I try to get unique values. The union is [[[0, 2, 3, 3, 5, 5, 7, 7, 8, 9], [4, 2, 3, 7, 2, 7, 7, 4, 8, 9]], [[1, 7, 3, 5, 4, 9, 5, 6, 8, 4], [3, 7, 3, 2, 8, 8, 5, 9, 8, 4]]] with count as [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] — helloworld, Jun 17 '22 at 11:28

asdf · Answer 1 · 2022-06-19T14:27:56.497

I don't know how you could achieve this in pytorch, since AFAIK pytorch doesn't support set operations on tensors. In your js() implementation, union calculation should work, but intersection = union[counts > 1] doesn't give you the right result if one of the tensors contains duplicated values. Numpy on the other hand has built-on support with union1d and intersect1d. You can use numpy vectorization to calculate pairwise jaccard indices without using for-loops:

import numpy as np

def num_intersection(vec1: np.ndarray, vec2: np.ndarray) -> int:
    return np.intersect1d(vec1, vec2, assume_unique=False).size

def num_union(vec1: np.ndarray, vec2: np.ndarray) -> int:
    return np.union1d(vec1, vec2).size

def jaccard1d(vec1: np.ndarray, vec2: np.ndarray) -> float:
    assert vec1.ndim == vec2.ndim == 1 and vec1.shape[0] == vec2.shape[0], 'vec1 and vec2 must be 1D arrays of equal length'
    return num_intersection(vec1, vec2) / num_union(vec1, vec2)
jaccard2d = np.vectorize(jaccard1d, signature='(m),(n)->()')

def jaccard(vecs1: np.ndarray, vecs2: np.ndarray) -> np.ndarray:
    """
    Return intersection-over-union (Jaccard index) between two sets of vectors.
    Both sets of vectors are expected to be flattened to 2D, where dim 0 is the batch
    dimension and dim 1 contains the flattened vectors of length V (jaccard index of 
    an n-dimensional vector and of its flattened 1D-vector is equal).
    Args:
        vecs1 (ndarray[N, V]): first set of vectors
        vecs2 (ndarray[M, V]): second set of vectors
    Returns:
        ndarray[N, M]: the NxM matrix containing the pairwise jaccard indices for every vector in vecs1 and vecs2
    """
    assert vecs1.ndim == vecs2.ndim == 2 and vecs1.shape[1] == vecs2.shape[1], 'vecs1 and vecs2 must be 2D arrays with equal length in axis 1'
    return jaccard2d(vecs1, vecs2)

This is of course suboptimal because the code doesn't run on the GPU. If I run the jaccard function with vecs1 of shape (1, 10) and vecs2 of shape (10_000, 10) I get a mean loop time of 200 ms ± 1.34 ms on my machine, which should probably be fast enough for most use cases. And conversion between pytorch and numpy arrays is very cheap.

To apply this function to your problem with array a:

a = torch.tensor(a).numpy()  # just to demonstrate
ious = [jaccard(batch[:1, :], batch[1:, :]) for batch in a]
np.array(ious).squeeze()  # 2 batches with 3 scores each -> 2x3 matrix

# array([[0.28571429, 0.4       , 0.16666667],
#        [0.14285714, 0.16666667, 0.14285714]])

Use torch.from_numpy() on the result to get a pytorch tensor again if needed.

Update:

If you need a pytorch version for calculating the Jaccard index, I partially implemented numpy's intersect1d in torch:

from torch import Tensor

def torch_intersect1d(t1: Tensor, t2: Tensor, assume_unique: bool = False) -> Tensor:
    if t1.ndim > 1:
        t1 = t1.flatten()
    if t2.ndim > 1:
        t2 = t2.flatten()
    if not assume_unique:
        t1 = t1.unique(sorted=True)
        t2 = t2.unique(sorted=True)
    # generate a m x n intersection matrix where m is numel(t1) and n is numel(t2)
    intersect = t1[(t1.view(-1, 1) == t2.view(1, -1)).any(dim=1)]
    if not assume_unique:
        intersect = intersect.sort().values
    return intersect

def torch_union1d(t1: Tensor, t2: Tensor) -> Tensor:
    return torch.cat((t1.flatten(), t2.flatten())).unique()

def torch_jaccard1d(t1: Tensor, t2: Tensor) -> float:
    return torch_intersect1d(t1, t2).numel() / torch_union1d(t1, t2).numel()

To vectorize the torch_jaccard1d function, you might want to look into torch.vmap, which lets you vectorize a function over an arbitrary batch dimension (similar to numpy's vectorize). The vmap function is a prototype feature and not yet available in the usual pytorch distributions, but you can get it using nightly builds of pytorch. I haven't tested it but this might work.

Thanks asdf for the elaborate approach. The main reason that I'm so stuck on pytorch is so that I can use GPUs to calculate the similarity. — helloworld, Jun 19 '22 at 05:38
You're welcome. I doubt that using the GPU would lead to a speedup if your batches/vectors are similar to the ones in your question. GPU performance usually only exceeds CPU performance for very large matrix calculations and might even be slower for small matrices. I'll update my post with a torch implementation of 1D jaccard calculation. — asdf, Jun 19 '22 at 14:17

score 1 · Answer 2 · answered Jun 17 '22 at 16:12

1

You can use numpy.unique , numpy.intersect1d , numpy.union1d and transform recommended function here for computing jaccard_similarity in python:

import torch
import numpy as np

def jaccard_similarity_numpy(list1, list2):
    s1 = np.unique(list1)
    s2 = np.unique(list2)
    return (len(np.intersect1d(s1,s2))) / len(np.union1d(s1,s2))

tns_a = torch.tensor([
    [[3, 8, 6, 8, 7],
    [9, 7, 4, 8, 1],
    [7, 8, 8, 5, 7],
    [3, 9, 9, 4, 4]],

    [[7, 3, 8, 1, 7],
    [3, 0, 3, 4, 2],
    [9, 1, 6, 1, 6],
    [2, 7, 0, 6, 6]]
])

tns_b = torch.empty(tns_a.shape[0],tns_a.shape[1]-1)
for i in range(tns_a.shape[0]):
    tns_tmp = tns_a[i][0]
    for j , tns in enumerate(tns_a[i][1:]):
        tns_b[i][j] = (jaccard_similarity_numpy(tns_tmp, tns))

print(tns_b)

Output:

tensor([[0.2857, 0.4000, 0.1667],
        [0.1429, 0.1667, 0.1429]])

answered Jun 17 '22 at 16:12

I'mahdi

23,382
5
22
30

1

Thanks for the solution, @I'mahdi. But I'm really looking for a solution using pytorch (to utilise GPUs) and without for loops – helloworld Jun 19 '22 at 05:58
@helloworld, welcome, OK, So do you want that can tensor computing on GPU? – I'mahdi Jun 19 '22 at 06:01
Yes @I'mahdi. I'll be computing the jaccard distance for a batch of 45000 vectors each with dimensions of 110x12 so using GPUs to parallelise it could really help – helloworld Jun 19 '22 at 13:24
1

@helloworld, I write another approach with `numba` in another answer and get `5 seconds` for data with shape `(45_000, 110, 12)`. Is this OK or do you want a faster approach? – I'mahdi Jun 19 '22 at 14:35

I'mahdi · Answer 3 · 2022-06-19T15:00:12.817

You don't tag numba, but you want a fast approach for computing jaccard_similarity for data with shape (45_000, 110, 12) then I highly recommend you to use numba with parallel=True. I get run_time for random data with shape (45_000, 110, 12) only 5 sec: (Run-time get on colab)

import numpy as np
import numba as nb
import torch

@nb.jit(nopython=True, parallel=True)
def jaccard_Imahdi_Numba(batch_tensor):
    tns_b = np.empty((batch_tensor.shape[0],batch_tensor.shape[1]-1))
    for i in nb.prange(batch_tensor.shape[0]):
        tns_tmp = batch_tensor[i][0]
        for j , tns in enumerate(batch_tensor[i][1:]):
            s1 = set(tns_tmp)
            s2 = set(tns)
            res = len(s1.intersection(s2)) / len(s1.union(s2))
            tns_b[i][j] = res
    return tns_b

large_tns = torch.tensor(np.random.randint(0,100, (45_000,110,12)))
%timeit jaccard_Imahdi_Numba(large_tns.numpy())
# 1 loop, best of 5: 5.37 s per loop

I write below Benchmark for 50_000 batch with shape (4,5) -> (50_000, 4, 5). We get 116 ms with numba but other approach get 8 sec and 9 sec: (Run-time get on colab)

import numpy as np
import numba as nb
import torch

def jaccard_asdf(batch_tensor):
    def num_intersection(vec1: np.ndarray, vec2: np.ndarray) -> int:
        return np.intersect1d(vec1, vec2, assume_unique=False).size

    def num_union(vec1: np.ndarray, vec2: np.ndarray) -> int:
        return np.union1d(vec1, vec2).size

    def jaccard1d(vec1: np.ndarray, vec2: np.ndarray) -> float:
        assert vec1.ndim == vec2.ndim == 1 and vec1.shape[0] == vec2.shape[0], 'vec1 and vec2 must be 1D arrays of equal length'
        return num_intersection(vec1, vec2) / num_union(vec1, vec2)
    jaccard2d = np.vectorize(jaccard1d, signature='(m),(n)->()')

    def jaccard(vecs1: np.ndarray, vecs2: np.ndarray) -> np.ndarray:
        assert vecs1.ndim == vecs2.ndim == 2 and vecs1.shape[1] == vecs2.shape[1], 'vecs1 and vecs2 must be 2D arrays with equal length in axis 1'
        return jaccard2d(vecs1, vecs2)

    a = torch.tensor(batch_tensor).numpy()  # just to demonstrate
    ious = [jaccard(batch[:1, :], batch[1:, :]) for batch in a]
    return np.array(ious).squeeze() 


def jaccard_Imahdi(batch_tensor):
    def jaccard_similarity_numpy(list1, list2):
        s1 = np.unique(list1)
        s2 = np.unique(list2)
        return (len(np.intersect1d(s1,s2))) / len(np.union1d(s1,s2))

    tns_b = np.empty((batch_tensor.shape[0],batch_tensor.shape[1]-1))
    for i in range(batch_tensor.shape[0]):
        tns_tmp = batch_tensor[i][0]
        for j , tns in enumerate(batch_tensor[i][1:]):
            tns_b[i][j] = (jaccard_similarity_numpy(tns_tmp, tns))
    return tns_b


@nb.jit(nopython=True, parallel=True)
def jaccard_Imahdi_Numba(batch_tensor):
    tns_b = np.empty((batch_tensor.shape[0],batch_tensor.shape[1]-1))
    for i in nb.prange(batch_tensor.shape[0]):
        tns_tmp = batch_tensor[i][0]
        for j , tns in enumerate(batch_tensor[i][1:]):
            s1 = set(tns_tmp)
            s2 = set(tns)
            res = len(s1.intersection(s2)) / len(s1.union(s2))
            tns_b[i][j] = res
    return tns_b

small_tns = torch.tensor([
    [[3, 8, 6, 8, 7],
    [9, 7, 4, 8, 1],
    [7, 8, 8, 5, 7],
    [3, 9, 9, 4, 4]],

    [[7, 3, 8, 1, 7],
    [3, 0, 3, 4, 2],
    [9, 1, 6, 1, 6],
    [2, 7, 0, 6, 6]]
])

print(f'''output jaccard_asdf: \n{
    jaccard_asdf(small_tns)
    }''')

print(f'''output jaccard_Imahdi: \n{
    jaccard_Imahdi(small_tns)
    }''')

print(f'''output jaccard_Imahdi_Numba: \n{
    jaccard_Imahdi(small_tns)
    }''')

large_tns = torch.tensor(np.random.randint(0,100, (50_000,4,5)))

%timeit jaccard_Imahdi(large_tns)
# 1 loop, best of 5: 8.32 s per loop

%timeit jaccard_asdf(large_tns)
# 1 loop, best of 5: 9.92 s per loop

%timeit jaccard_Imahdi_Numba(large_tns.numpy())
# 1 loop, best of 5: 116 ms per loop

Output:

output jaccard_asdf: 
[[0.28571429 0.4        0.16666667]
 [0.14285714 0.16666667 0.14285714]]
output jaccard_Imahdi: 
[[0.28571429 0.4        0.16666667]
 [0.14285714 0.16666667 0.14285714]]
output jaccard_Imahdi_Numba: 
[[0.28571429 0.4        0.16666667]
 [0.14285714 0.16666667 0.14285714]]

Thanks @I'mahdi. This is a really optimised solution. But does it run on GPUs or is it just parallelising across CPUs? — helloworld, Jun 20 '22 at 05:00
@helloworld, welcome, no problem, this solution can run on GPU but you can see on calling fuction we pass `numpy.array` to `function` not **Tensor**, `Numba` can work with `numpy.array` — I'mahdi, Jun 20 '22 at 05:11

Find jaccard similarity among a batch of vectors in PyTorch

3 Answers3

Update: