python numpy speed up 2d duplicate search

Question

I need to find duplicates in a 2d numpy array. As a result i want a list of the same length as the input which points to the first occurrence of the corresponding value. For example the array [[1, 0, 0], [1, 0, 0], [2, 3, 4]] has two equal elements 0 and 1. The method should return [0, 0, 2] (see examples in code below). The following code is working but slow for large arrays.

import numpy as np


def duplicates(ar):
    """
    Args:
        ar (array_like): array

    Returns:
        list of int: int is pointing to first occurence of unique value
    """
    # duplicates array:
    dup = np.full(ar.shape[0], -1, dtype=int)
    for i in range(ar.shape[0]):
        if dup[i] != -1:
            # i is already found to be a
            continue
        else:
            dup[i] = i
        for j in range(i + 1, ar.shape[0]):
            if (ar[i] == ar[j]).all():
                dup[j] = i
    return dup


if __name__ == '__main__':
    n = 100
    # shortest extreme for n points
    a1 = np.array([[0, 1, 2]] * n)
    assert (duplicates(a1) == np.full(n, 0)).all(), True

    # longest extreme for n points
    a2 = np.linspace(0, 1, n * 3).reshape((n, 3))
    assert (duplicates(a2) == np.arange(0, n)).all(), True

    # test case
    a3 = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
    assert (duplicates(a3) == [0, 0, 2]).all(), True

Any idea how to speedup the process (e.g. avoid the second for loop) or alternative implementations? Cheers

score 2 · Answer 1 · answered Sep 18 '17 at 17:17

What you're doing requires you to compare N rows, each of length M, against one another in every possible pairing. That means at best it can scale as O(N^2 * M), in the scenario that there are no duplicates.

A better method is to hash each row. If the time required to hash scales as O(M) then this should scale as O(N * M). You can do that with a dictionary:

def duplicates(ar):
    """
    Args:
        ar (array_like): array

    Returns:
        list of int: int is pointing to first occurence of unique value
    """
    first_occurence = {}
    # duplicates array:
    dup = np.zeros(ar.shape[0], dtype=int)
    for i in range(ar.shape[0]):
        as_tuple = tuple(ar[i])
        if as_tuple not in first_occurence:
            first_occurence[as_tuple] = i
        dup[i] = first_occurence[as_tuple]
    return dup

Divakar · Accepted Answer · 2017-09-19T09:18:50.200

Here's one vectorized approach -

def duplicates_1(a):
    sidx = np.lexsort(a.T)
    b = a[sidx]

    grp_idx0 = np.flatnonzero((b[1:] != b[:-1]).any(1))+1
    grp_idx = np.concatenate(( [0], grp_idx0, [b.shape[0] ] ))
    ids = np.repeat(range(len(grp_idx)-1), np.diff(grp_idx))
    sidx_mapped = argsort_unique(sidx)
    ids_mapped = ids[sidx_mapped]

    grp_minidx = sidx[grp_idx[:-1]]
    out = grp_minidx[ids_mapped]
    return out

Using the concept of array-view that enables us to work at 1D level, here's a modification of the first approach -

def duplicates_1_view1D(a):
    a1D = view1D(a)
    sidx0 = a1D.argsort()
    b0 = a1D[sidx0]

    N = len(b0)
    grp_idx0 = np.concatenate(( [0], np.flatnonzero(b0[1:] != b0[:-1])+1, [N] ))
    ids0 = np.repeat(range(len(grp_idx0)-1), np.diff(grp_idx0))
    sidx_mapped0 = argsort_unique(sidx0)
    ids_mapped0 = ids0[sidx_mapped0]

    grp_minidx0 = sidx0[grp_idx0[:-1]]
    out0 = grp_minidx0[ids_mapped0]
    return out0

Helper functions -

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

# https://stackoverflow.com/a/43411559/ @Divakar
def argsort_unique(idx):
    n = idx.size
    sidx = np.empty(n,dtype=int)
    sidx[idx] = np.arange(n)
    return sidx

The Vectorize method seems to have problems for the special case ( see the EDIT in my reply) — Daniel Böckenhoff, Sep 19 '17 at 09:08
@DanielBöckenhoff Yup was a tiny bug. Should had been `.any` instead of `.all`. Just fixed. That shouldn't change the timings. — Divakar, Sep 19 '17 at 09:19

Daniel Böckenhoff · Answer 3 · 2017-09-19T09:34:20.257

I timed the answers from Divakar and Jeremy for the two test cases in my code example marked with "# shortest extreme for n points" and "# longest extreme for n points". All answers yield the expected results and improve the speed extremely. It seems Divakars vectorized approach is the fastest all along. Thanks. All credit goes to Divakar and Jeremy.

EDIT: Implementing the vectorized approach some further testing revealed an error. For the example array

[[ 0.  0.  0.]
 [ 1.  0.  0.]
 [ 1.  1.  0.]
 [ 0.  1.  0.]
 [ 2.  0.  0.]
 [ 3.  0.  0.]
 [ 3.  1.  0.]
 [ 2.  1.  0.]]

the vectorized method retrieves an all 0 list. The view1D is second fastest, so i take that.

EDIT2: Divakar fixed the bug. Thanks

python numpy speed up 2d duplicate search

3 Answers3