3

I need to find duplicates in a 2d numpy array. As a result i want a list of the same length as the input which points to the first occurrence of the corresponding value. For example the array [[1, 0, 0], [1, 0, 0], [2, 3, 4]] has two equal elements 0 and 1. The method should return [0, 0, 2] (see examples in code below). The following code is working but slow for large arrays.

import numpy as np


def duplicates(ar):
    """
    Args:
        ar (array_like): array

    Returns:
        list of int: int is pointing to first occurence of unique value
    """
    # duplicates array:
    dup = np.full(ar.shape[0], -1, dtype=int)
    for i in range(ar.shape[0]):
        if dup[i] != -1:
            # i is already found to be a
            continue
        else:
            dup[i] = i
        for j in range(i + 1, ar.shape[0]):
            if (ar[i] == ar[j]).all():
                dup[j] = i
    return dup


if __name__ == '__main__':
    n = 100
    # shortest extreme for n points
    a1 = np.array([[0, 1, 2]] * n)
    assert (duplicates(a1) == np.full(n, 0)).all(), True

    # longest extreme for n points
    a2 = np.linspace(0, 1, n * 3).reshape((n, 3))
    assert (duplicates(a2) == np.arange(0, n)).all(), True

    # test case
    a3 = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
    assert (duplicates(a3) == [0, 0, 2]).all(), True

Any idea how to speedup the process (e.g. avoid the second for loop) or alternative implementations? Cheers

3 Answers3

2

What you're doing requires you to compare N rows, each of length M, against one another in every possible pairing. That means at best it can scale as O(N^2 * M), in the scenario that there are no duplicates.

A better method is to hash each row. If the time required to hash scales as O(M) then this should scale as O(N * M). You can do that with a dictionary:

def duplicates(ar):
    """
    Args:
        ar (array_like): array

    Returns:
        list of int: int is pointing to first occurence of unique value
    """
    first_occurence = {}
    # duplicates array:
    dup = np.zeros(ar.shape[0], dtype=int)
    for i in range(ar.shape[0]):
        as_tuple = tuple(ar[i])
        if as_tuple not in first_occurence:
            first_occurence[as_tuple] = i
        dup[i] = first_occurence[as_tuple]
    return dup
Jeremy McGibbon
  • 3,527
  • 14
  • 22
1

Here's one vectorized approach -

def duplicates_1(a):
    sidx = np.lexsort(a.T)
    b = a[sidx]

    grp_idx0 = np.flatnonzero((b[1:] != b[:-1]).any(1))+1
    grp_idx = np.concatenate(( [0], grp_idx0, [b.shape[0] ] ))
    ids = np.repeat(range(len(grp_idx)-1), np.diff(grp_idx))
    sidx_mapped = argsort_unique(sidx)
    ids_mapped = ids[sidx_mapped]

    grp_minidx = sidx[grp_idx[:-1]]
    out = grp_minidx[ids_mapped]
    return out 

Using the concept of array-view that enables us to work at 1D level, here's a modification of the first approach -

def duplicates_1_view1D(a):
    a1D = view1D(a)
    sidx0 = a1D.argsort()
    b0 = a1D[sidx0]

    N = len(b0)
    grp_idx0 = np.concatenate(( [0], np.flatnonzero(b0[1:] != b0[:-1])+1, [N] ))
    ids0 = np.repeat(range(len(grp_idx0)-1), np.diff(grp_idx0))
    sidx_mapped0 = argsort_unique(sidx0)
    ids_mapped0 = ids0[sidx_mapped0]

    grp_minidx0 = sidx0[grp_idx0[:-1]]
    out0 = grp_minidx0[ids_mapped0]
    return out0 

Helper functions -

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

# https://stackoverflow.com/a/43411559/ @Divakar
def argsort_unique(idx):
    n = idx.size
    sidx = np.empty(n,dtype=int)
    sidx[idx] = np.arange(n)
    return sidx
Divakar
  • 218,885
  • 19
  • 262
  • 358
1

I timed the answers from Divakar and Jeremy for the two test cases in my code example marked with "# shortest extreme for n points" and "# longest extreme for n points". All answers yield the expected results and improve the speed extremely. It seems Divakars vectorized approach is the fastest all along. Minimum Time Maximum Time Thanks. All credit goes to Divakar and Jeremy.

EDIT: Implementing the vectorized approach some further testing revealed an error. For the example array

[[ 0.  0.  0.]
 [ 1.  0.  0.]
 [ 1.  1.  0.]
 [ 0.  1.  0.]
 [ 2.  0.  0.]
 [ 3.  0.  0.]
 [ 3.  1.  0.]
 [ 2.  1.  0.]]

the vectorized method retrieves an all 0 list. The view1D is second fastest, so i take that.

EDIT2: Divakar fixed the bug. Thanks