4

I have a 2D numpy array with hundreds of thousands of rows and a thousand or so columns (let's say it's a N x P array with N = 200,000, P = 1000). The goal here is to compute the number of identical elements between each pair of row vectors, ideally using numpy array magic that won't require me to perform a loop over 199,999 * 100,000 such pairs. Since it is probably not feasible to store a 200,000 x 200,000 array the output would probably be in Nx3 sparse coordinate format, e.g. if the input is in the form:

5 12 14 200   0 45223
7 12 14   0 200 60000
7  6 23   0   0 45223
5  6 14 200   0 45223

the resulting (dense) NxN matrix M would be (without caring about diagonal elements):

0 2 2 4
2 0 2 1
2 2 0 3
4 1 3 0

such that Mij contains the number of identical elements between the initial row i and the initial row j, assuming 0-based indexing. And the expected sparse output equivalent would thus be:

0 1 2
0 2 2
0 3 4
1 2 2 
1 3 1
2 3 3

A naive, horribly inefficient way to implement this would be:

import itertools
import numpy as np

def pairwise_identical_elements(small_matrix):
    n, p = small_matrix.shape
    coordinates = itertools.combinations(range(n), 2)
    sparse_coordinate_matrix = []
    for row1, row2 in itertools.combinations(small_matrix, 2):
        idx1, idx2 = next(coordinates)
        count = p - np.count_nonzero(row1 - row2)
        sparse_coordinate_matrix.append([idx1, idx2, count])
    return sparse_coordinate_matrix

I have looked into distance metric implementations such as the Jaccard similarity in scipy and sklearn but they all assume the input row vectors have to be binary. I have also tried adding a third dimension to make the entries binary (e.g. an entry '9' becoming a vector of zeroes with a 1 in the 9th position) but there are obvious memory issues (an entry '45223' would require the third dimension to stretch by that many elements).

Is there an efficient, scalable and/or pythonic solution using numpy or scipy in a way that I have missed?

Edit: after looking further into scipy I found something that closely matches what I'm trying to do, namely scipy.sparse.distance.pdist with the Hamming metric. However it returns the output in 'condensed' form, and since we are trying to avoid conversion to a full dense array to save memory the question can become: how to convert a condensed distance matrix into a sparse one?

lurena
  • 41
  • 3
  • Please induce some code to enlighten us more of what you intend to do – Abr001am Sep 21 '17 at 12:35
  • Sure, I added a small snippet that implements what I'm trying to do in a naive way. The input is assumed to be a standard numpy array. – lurena Sep 21 '17 at 13:26
  • so where is the stake here ? spacial complexity ? or about time ? – Abr001am Sep 21 '17 at 14:05
  • Memory is an obvious issue (since python tends to just crash when processing too many objects), but any run time in the same order of magnitude as the aforementioned pdist function in the scipy.sparse.distance module is fine. – lurena Sep 21 '17 at 14:16
  • @lurena could you explain the example in the question, because running your function to the example input gives a different output – Saullo G. P. Castro Sep 22 '17 at 16:33
  • Oops. The function assumes 0-based indexing, my example used 1-based indexing. Hope it's clearer now. – lurena Sep 25 '17 at 09:23

1 Answers1

0

As said in the comments, scipy's pdist with 'hamming' is the simplest efficient way to solve this, both for space considerations and cpu time.

You won't be able to be more memory-efficient that its condensed output. Indeed when writing into your "sparse" format, you need a (N*(N-1)/2, 3) matrix, compared to N*(N-1)/2 vector returned by pdist

M1L0U
  • 1,175
  • 12
  • 20