I have a 2D numpy array with hundreds of thousands of rows and a thousand or so columns (let's say it's a N x P array with N = 200,000, P = 1000). The goal here is to compute the number of identical elements between each pair of row vectors, ideally using numpy array magic that won't require me to perform a loop over 199,999 * 100,000 such pairs. Since it is probably not feasible to store a 200,000 x 200,000 array the output would probably be in Nx3 sparse coordinate format, e.g. if the input is in the form:
5 12 14 200 0 45223
7 12 14 0 200 60000
7 6 23 0 0 45223
5 6 14 200 0 45223
the resulting (dense) NxN matrix M would be (without caring about diagonal elements):
0 2 2 4
2 0 2 1
2 2 0 3
4 1 3 0
such that Mij contains the number of identical elements between the initial row i and the initial row j, assuming 0-based indexing. And the expected sparse output equivalent would thus be:
0 1 2
0 2 2
0 3 4
1 2 2
1 3 1
2 3 3
A naive, horribly inefficient way to implement this would be:
import itertools
import numpy as np
def pairwise_identical_elements(small_matrix):
n, p = small_matrix.shape
coordinates = itertools.combinations(range(n), 2)
sparse_coordinate_matrix = []
for row1, row2 in itertools.combinations(small_matrix, 2):
idx1, idx2 = next(coordinates)
count = p - np.count_nonzero(row1 - row2)
sparse_coordinate_matrix.append([idx1, idx2, count])
return sparse_coordinate_matrix
I have looked into distance metric implementations such as the Jaccard similarity in scipy and sklearn but they all assume the input row vectors have to be binary. I have also tried adding a third dimension to make the entries binary (e.g. an entry '9' becoming a vector of zeroes with a 1 in the 9th position) but there are obvious memory issues (an entry '45223' would require the third dimension to stretch by that many elements).
Is there an efficient, scalable and/or pythonic solution using numpy or scipy in a way that I have missed?
Edit: after looking further into scipy I found something that closely matches what I'm trying to do, namely scipy.sparse.distance.pdist with the Hamming metric. However it returns the output in 'condensed' form, and since we are trying to avoid conversion to a full dense array to save memory the question can become: how to convert a condensed distance matrix into a sparse one?