How to get a list of all indices of repeated elements in a numpy array

Question

I'm trying to get the index of all repeated elements in a numpy array, but the solution I found for the moment is REALLY inefficient for a large (>20000 elements) input array (it takes more or less 9 seconds). The idea is simple:

records_arrayis a numpy array of timestamps (datetime) from which we want to extract the indexes of repeated timestamps
time_array is a numpy array containing all the timestamps that are repeated in records_array
records is a django QuerySet (which can easily converted to a list) containing some Record objects. We want to create a list of couples formed by all possible combinations of tagId attributes of Record corresponding to the repeated timestamps found from records_array.

Here is the working (but inefficient) code I have for the moment:

tag_couples = [];
for t in time_array:
    users_inter = np.nonzero(records_array == t)[0] # Get all repeated timestamps in records_array for time t
    l = [str(records[i].tagId) for i in users_inter] # Create a temporary list containing all tagIds recorded at time t
    if l.count(l[0]) != len(l): #remove tuples formed by the first tag repeated
        tag_couples +=[x for x in itertools.combinations(list(set(l)),2)] # Remove duplicates with list(set(l)) and append all possible couple combinations to tag_couples

I'm quite sure this can be optimized by using Numpy, but I can't find a way to compare records_array with each element of time_array without using a for loop (this can't be compared by just using ==, since they are both arrays).

score 46 · Accepted Answer · edited Sep 26 '20 at 18:44

A vectorized solution with numpy, on the magic of unique().

import numpy as np

# create a test array
records_array = np.array([1, 2, 3, 1, 1, 3, 4, 3, 2])

# creates an array of indices, sorted by unique element
idx_sort = np.argsort(records_array)

# sorts records array so all unique elements are together 
sorted_records_array = records_array[idx_sort]

# returns the unique values, the index of the first occurrence of a value, and the count for each element
vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)

# splits the indices into separate arrays
res = np.split(idx_sort, idx_start[1:])

#filter them with respect to their size, keeping only items occurring more than once
vals = vals[count > 1]
res = filter(lambda x: x.size > 1, res)

The following code was the original answer, which required a bit more memory, using numpy broadcasting and calling unique twice:

records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2])
vals, inverse, count = unique(records_array, return_inverse=True,
                              return_counts=True)

idx_vals_repeated = where(count > 1)[0]
vals_repeated = vals[idx_vals_repeated]

rows, cols = where(inverse == idx_vals_repeated[:, newaxis])
_, inverse_rows = unique(rows, return_index=True)
res = split(cols, inverse_rows[1:])

with as expected res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]

As a warning for this solution, `idx_sort` is non-deterministic for arrays with repeats since the default sort in np.sort is unstable. To fix this, make sure to use the merge sort kind: `np.argsort(a, kind='mergesort')`. — user27443, May 31 '21 at 23:27

Trenton McKinney · Answer 2 · 2021-09-04T15:33:54.563

The answer is complicated, and dependent upon the size, and number of unique elements in the array.
The following:
- Tests arrays with 2M elements, and up to 20k unique elements.
- Tests arrays up to 80k elements, with a max of 20k unique elements
  - For arrays under 40k elements, the tests have up to half the unique elements as the size of the array (e.g. 10k elements would have up to 5k unique elements).

Arrays with 2M Elements

np.where is faster than defaultdict for up to about 200 unique elements, but slower than pandas.core.groupby.GroupBy.indices, and np.unique.
The solution using pandas, is the fastest solution for large arrays.

Arrays with up to 80k Elements

This is more situational, depending on the size of the array and the number of unique elements.
defaultdict is a fast option for arrays to about 2400 elements, especially with a large number of unique elements.
For arrays larger than 40k elements, and 20k unique elements, pandas is the fastest option.

`%timeit`

import random
import numpy
import pandas as pd
from collections import defaultdict

def dd(l):
    # default_dict test
    indices = defaultdict(list)
    for i, v in enumerate(l):
        indices[v].append(i)
    return indices


def npw(l):
    # np_where test
    return {v: np.where(l == v)[0] for v in np.unique(l)}


def uni(records_array):
    # np_unique test
    idx_sort = np.argsort(records_array)
    sorted_records_array = records_array[idx_sort]
    vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)
    res = np.split(idx_sort, idx_start[1:])
    return dict(zip(vals, res))


def daf(l):
    # pandas test
    return pd.DataFrame(l).groupby([0]).indices


data = defaultdict(list)

for x in range(4, 20000, 100):  # number of unique elements
    # create 2M element list
    random.seed(365)
    a = np.array([random.choice(range(x)) for _ in range(2000000)])
    
    res1 = %timeit -r2 -n1 -q -o dd(a)
    res2 = %timeit -r2 -n1 -q -o npw(a)
    res3 = %timeit -r2 -n1 -q -o uni(a)
    res4 = %timeit -r2 -n1 -q -o daf(a)
    
    data['defaut_dict'].append(res1.average)
    data['np_where'].append(res2.average)
    data['np_unique'].append(res3.average)
    data['pandas'].append(res4.average)
    data['idx'].append(x)

df = pd.DataFrame(data)
df.set_index('idx', inplace=True)

df.plot(figsize=(12, 5), xlabel='unique samples', ylabel='average time (s)', title='%timeit test: 2 run 1 loop each')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()

Tests with 2M elements

Tests with up to 80k elements

Ashish · Answer 3 · 2015-05-02T20:32:11.587

4

You can also do this:

a = [1,2,3,1,1,3,4,3,2]
index_sets = [np.argwhere(i==a) for i in np.unique(a)]

this will give you set of arrays with indices of unique elements.

[array([[0],[3],[4]], dtype=int64), 
array([[1],[8]], dtype=int64), 
array([[2],[5],[7]], dtype=int64), 
array([[6]], dtype=int64)]

Added: Further change in list comprehension can also discard single unique values and address the speed concern in case of many unique single occurring elements:

new_index_sets = [np.argwhere(i[0]== a) for i in np.array(np.unique(a, return_counts=True)).T if i[1]>=2]

this gives:

[array([[0],[3],[4]], dtype=int64), 
 array([[1],[8]], dtype=int64), 
 array([[2],[5],[7]], dtype=int64)]

edited May 02 '15 at 20:32

answered May 02 '15 at 14:50

Ashish

729
1
9
23

This will be very slow if the array contains many unique values. – gg349 May 02 '15 at 15:02
@gg349 Thanks for pointing that. A minor edit solved that purpose too. Now I hope it's fast enough. – Ashish May 02 '15 at 19:50
1

The code is now slow because it calls `where()` as many times as as many repeated values there are. Each call to `where` must traverse the whole array. – gg349 May 03 '15 at 08:19
I don't know how to further improve it. But, just tried `%timeit` for our solutions with a 20000 elements array and found that my final solution takes 2.23 ms, your 1st solution takes 2.83 ms and your 2nd solution takes 9.84 ms, per loop. – Ashish May 03 '15 at 08:44
On my machine, with input `random_integers(-1000,1000,20000)`, I get roughly a 16x speedup with my first solution above. I guess you sticked with a random array with very few different values? – gg349 May 03 '15 at 08:59
Indeed. I just used your `record_array` and multipied it 2000 time i.e. `records_array = np.array([1, 2, 3, 1, 1, 3, 4, 3, 2, 1]*2000)` which cased that. With np.random.randint(-1000,1000,20000) your solution is ~16x faster. – Ashish May 03 '15 at 09:09
on my machine I find `np.where(a[1:,0] != a[:-1,0])` fastest still – paddyg May 03 '15 at 19:41

user27443 · Answer 4 · 2021-05-31T23:00:40.010

I've found that not using np.unique, and instead using np.diff is significantly faster and handles non-sorted initial arrays much better.

To show this, I ran @Trenton McKinney's benchmark for a couple of the trial numbers (2 million and 20k) to show that the diff solution floors the others. It also does not require a sorted array or sorting the array, which is a significant benefit.

Here is the function:

def find_repeats(arr: np.ndarray) -> np.ndarray:
    """Find indices of repeat values in an array.

    Args:
        arr (np.ndarray): An array to find repeat values in.

    Returns:
        np.ndarray: An array of indices into arr which are the values which
            repeat.
    """

    arr_diff = np.diff(arr, append=[arr[-1] + 1])
    res_mask = arr_diff == 0
    arr_diff_zero_right = np.nonzero(res_mask)[0] + 1
    res_mask[arr_diff_zero_right] = True
    return np.nonzero(res_mask)[0]

2 Million Elements

20k Elements

Full Test Code

import random
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from collections import defaultdict
import time


def find_repeats(arr: np.ndarray) -> np.ndarray:
    """Find indices of repeat values in an array.

    Args:
        arr (np.ndarray): An array to find repeat values in.

    Returns:
        np.ndarray: An array of indices into arr which are the values which
            repeat.
    """

    arr_diff = np.diff(arr, append=[arr[-1] + 1])
    res_mask = arr_diff == 0
    arr_diff_zero_right = np.nonzero(res_mask)[0] + 1
    res_mask[arr_diff_zero_right] = True
    return np.nonzero(res_mask)[0]


def dd(l):
    # default_dict test
    indices = defaultdict(list)
    for i, v in enumerate(l):
        indices[v].append(i)
    return indices


def npw(l):
    # np_where test
    return {v: np.where(l == v)[0] for v in np.unique(l)}


def uni(records_array):
    # np_unique test
    idx_sort = np.argsort(records_array)
    sorted_records_array = records_array[idx_sort]
    vals, idx_start, count = np.unique(
        sorted_records_array, return_counts=True, return_index=True)
    res = np.split(idx_sort, idx_start[1:])
    return dict(zip(vals, res))


def daf(l):
    # pandas test
    return pd.DataFrame(l).groupby([0]).indices


data = defaultdict(list)

for x in range(4, 20000, 1000):  # number of unique elements
    print(f"{x} trial done")
    # create 2M element list
    random.seed(365)
    a = np.array([random.choice(range(x)) for _ in range(2000000)])
    num_runs = 2
    t0 = time.time()
    for i in range(num_runs):
        dd(a)
    res1 = time.time() - t0

    t0 = time.time()
    for i in range(num_runs):
        uni(a)
    res3 = time.time() - t0

    t0 = time.time()
    for i in range(num_runs):
        daf(a)
    res4 = time.time() - t0

    t0 = time.time()
    for i in range(num_runs):
        find_repeats(a)
    res5 = time.time() - t0

    data['defaut_dict'].append(res1 / num_runs)
    data['np_unique'].append(res3 / num_runs)
    data['pandas'].append(res4 / num_runs)
    data['np_diff'].append(res5 / num_runs)
    data['idx'].append(x)

df = pd.DataFrame(data)
df.set_index('idx', inplace=True)

df.plot(figsize=(12, 5), xlabel='unique samples',
        ylabel='average time (s)', title='%timeit test: 2 run 1 loop each')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()

paddyg · Answer 5 · 2016-02-29T11:26:20.200

You could do something along the lines of:

1. add original index ref so [[1,0],[2,1],[3,2],[1,3],[1,4]...
2. sort on [:,0]
3. use np.where(ra[1:,0] != ra[:-1,0])
4. use the list of indexes from above to construct your final list of lists

EDIT - OK so after my quick reply I've been away for a while and I see I've been voted down which is fair enough as numpy.argsort() is a much better way than my suggestion. I did vote up the numpy.unique() answer as this is an interesting feature. However if you use timeit you will find that

idx_start = np.where(sorted_records_array[:-1] != sorted_records_array[1:])[0] + 1
res = np.split(idx_sort, idx_start)

is marginally faster than

vals, idx_start = np.unique(sorted_records_array, return_index=True)
res = np.split(idx_sort, idx_start[1:])

Further edit follow question by @Nicolas

I'm not sure you can. It would be possible to get two arrays of indices in corresponding with the break points but you can't break different 'lines' of the array up into different sized pieces using np.split so

a = np.array([[4,27,42,12, 4 .. 240, 12], [3,65,23...] etc])
idx = np.argsort(a, axis=1)
sorted_a = np.diagonal(a[:, idx[:]]).T
idx_start = np.where(sorted_a[:,:-1] != sorted_a[:,1:])

# idx_start => (array([0,0,0,..1,1,..]), array([1,4,6,7..99,0,4,5]))

but that might be good enough depending on what you want to do with the information.

Could this be made to work somehow if the original array has multiple rows and iteration through each row without a for loop? — Nickpick, Feb 24 '16 at 20:04
You won't be able to do the splitting. I will add an edit above — paddyg, Feb 24 '16 at 22:25

Matthew · Answer 6 · 2015-05-02T14:48:33.603

so I was unable to get rid of the for loop, but I was able to pair it down to using the for loop marginally using the set data type and the list.count() method:

data = [1,2,3,1,4,5,2,2]
indivs = set(data)

multi_index = lambda lst, val: [i for i, x in enumerate(lst) if x == val]

if data != list(indivs):
    dupes = [multi_index(data, i) for i in indivs if data.count(i) > 1]

Where you loop over your indivs set, which contains the values (no duplicates) and then loop over the full list if you find an item with a duplicate. Am looking into numpy alternative if this isn't fast enough for you. Generator objects might also speed this up if need be.

Edit: gg349's answer holds the numpy solution I was working on!

Mateen Ulhaq · Answer 7 · 2021-08-21T23:57:06.983

`np.unique` for all indices

@gg349's solution packaged up into a function:

def np_unique_indices(arr, **kwargs):
    """Unique indices for N-D arrays."""
    vals, indices, *others = np_unique_indices_1d(arr.reshape(-1), **kwargs)
    indices = [np.stack(np.unravel_index(x, arr.shape)) for x in indices]
    return vals, indices, *others


def np_unique_indices_1d(arr, **kwargs):
    """Unique indices for 1D arrays."""
    sort_indices = np.argsort(arr)
    arr = np.asarray(arr)[sort_indices]
    vals, first_indices, *others = np.unique(
        arr, return_index=True, **kwargs
    )
    indices = np.split(sort_indices, first_indices[1:])
    for x in indices:
        x.sort()
    return vals, indices, *others

It is essentially the same as np.unique but returns all indices, not just the first indices.

Example usage:

arr = np.array([
    [0, 1, 1, 0],
    [0, 2, 2, 0],
    [0, 2, 2, 0],
    [0, 1, 1, 0],
])

vals, indices = np_unique_indices(arr)

for val, idx in zip(vals, indices):
    print(f"{val}:\n{idx}\n")

Output:

0:
[[0 0 1 1 2 2 3 3]
 [0 3 0 3 0 3 0 3]]

1:
[[0 0 3 3]
 [1 2 1 2]]

2:
[[1 1 2 2]
 [1 2 1 2]]

Leonardo Queiroz · Answer 8 · 2021-07-27T08:15:44.553

1

import numpy as np
from numpy.lib import recfunctions as rfn

ndtype = [('records_array', int)] # Check the data type
records_array = np.ma.array([1, 2, 1, 3, 2, 3, 3, 4, 5]).view(ndtype) # Structured array
idxs = list(rfn.find_duplicates(records_array, key=None, ignoremask=True, return_index=True)[1]) # List of indices of repeated elements

edited Jul 27 '21 at 08:15

answered Jul 27 '21 at 08:06

Leonardo Queiroz

7
2

Mateen Ulhaq · Answer 9 · 2021-08-22T01:07:12.837

`numba.jit`

Another solution, but using numba.jit:

def np_unique_indices(arr, **kwargs):
    """Unique indices for N-D arrays."""
    vals, indices = np_unique_indices_1d(arr.reshape(-1))
    vals = np.asarray(vals)
    indices = [np.stack(np.unravel_index(x, arr.shape)) for x in indices]
    return vals, indices


@numba.njit
def np_unique_indices_1d(arr):
    """Unique indices for 1D arrays."""
    idxs = [[0 for _ in range(0)] for _ in range(0)]
    ptr = {}
    ptr_count = 0

    for i, x in enumerate(arr):
        if (x in ptr) == False:
            idxs.append([0 for _ in range(0)])
            ptr[x] = ptr_count
            ptr_count += 1
        idxs[ptr[x]].append(i)

    vals = [x for x in ptr]
    idxs = [np.array(x) for x in idxs]
    return vals, idxs

Using @Trenton McKinney's and user27443's benchmark:

Note that the performance of all the solutions depends on the size of the arrays and the amount of unique labels, so I recommend testing them yourself for your own data.

How to get a list of all indices of repeated elements in a numpy array

9 Answers9

Arrays with 2M Elements

Arrays with up to 80k Elements

`%timeit`

Tests with 2M elements

Tests with up to 80k elements

2 Million Elements

20k Elements

Full Test Code

`np.unique` for all indices

`numba.jit`

Linked

Related

How to get a list of all indices of repeated elements in a numpy array

9 Answers9

Arrays with 2M Elements

Arrays with up to 80k Elements

%timeit

Tests with 2M elements

Tests with up to 80k elements

2 Million Elements

20k Elements

Full Test Code

np.unique for all indices

numba.jit

Linked

Related

`%timeit`

`np.unique` for all indices

`numba.jit`