11

I have many large (>100,000,000) lists of integers that contain many duplicates. I want to get the indices where each of the element occur. Currently I am doing something like this:

import numpy as np
from collections import defaultdict

a = np.array([1, 2, 6, 4, 2, 3, 2])
d=defaultdict(list)
for i,e in enumerate(a):
    d[e].append(i)

d
defaultdict(<type 'list'>, {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]})

This method of iterating through each element is time consuming. Is there a efficient or vectorized way to do this?

Edit1 I tried the methods of Acorbe and Jaime on the following

a = np.random.randint(2000, size=10000000)

The results are

original: 5.01767015457 secs
Acorbe: 6.11163902283 secs
Jaime: 3.79637312889 secs
imsc
  • 7,492
  • 7
  • 47
  • 69

6 Answers6

10

This is very similar to what was asked here, so what follows is an adaptation of my answer there. The simplest way to vectorize this is to use sorting. The following code borrows a lot from the implementation of np.unique for the upcoming version 1.9, which includes unique item counting functionality, see here:

>>> a = np.array([1, 2, 6, 4, 2, 3, 2])
>>> sort_idx = np.argsort(a)
>>> a_sorted = a[idx]
>>> unq_first = np.concatenate(([True], a_sorted[1:] != a_sorted[:-1]))
>>> unq_items = a_sorted[unq_first]
>>> unq_count = np.diff(np.nonzero(unq_first)[0])

and now:

>>> unq_items
array([1, 2, 3, 4, 6])
>>> unq_count
array([1, 3, 1, 1, 1], dtype=int64)

To get the positional indices for each values, we simply do:

>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count))
>>> unq_idx
[array([0], dtype=int64), array([1, 4, 6], dtype=int64), array([5], dtype=int64),
 array([3], dtype=int64), array([2], dtype=int64)]

And you can now construct your dictionary zipping unq_items and unq_idx.

Note that unq_count doesn't count the occurrences of the last unique item, because that is not needed to split the index array. If you wanted to have all the values you could do:

>>> unq_count = np.diff(np.concatenate(np.nonzero(unq_first) + ([a.size],)))
>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count[:-1]))
Community
  • 1
  • 1
Jaime
  • 65,696
  • 17
  • 124
  • 159
3
def to_components(index):
    return np.split(np.argsort(index), np.cumsum(np.unique(index, return_counts=True)[1]))
Dmitry Kuzminov
  • 6,180
  • 6
  • 18
  • 40
2

this can be solved via python pandas (python data analysis library) and a DataFrame.groupby call.

Consider the following

 a = np.array([1, 2, 6, 4, 2, 3, 2])

 import pandas as pd
 df = pd.DataFrame({'a':a})

 gg = df.groupby(by=df.a)
 gg.groups

output

 {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]}
Acorbe
  • 8,367
  • 5
  • 37
  • 66
  • I have never used panda. Is this faster than pure python version. – imsc Apr 24 '14 at 12:45
  • @imsc, AFAIK it is based on numpy for what concerns data types and implements cython and pure C methods for speed. I regularly and happily use it for large dataset (~10M records). – Acorbe Apr 24 '14 at 12:47
  • Thanks, I will test it and let you know. – imsc Apr 24 '14 at 12:48
  • With 10 million random integers in the range `[0,20]`, on my machine at least, the pandas method takes 5.27s vs 5.86 s for the OP's method if `a` is a np.ndarray. If `a` is a standard python list, then pandas takes 7.85 s, whereas the original method takes 4.05 s. – JoshAdel Apr 24 '14 at 14:07
  • 1
    @JoshAdel, that's due to the fact that an intermediate conversion to np.ndarray is performed by pandas. I guess. – Acorbe Apr 24 '14 at 14:12
  • 2
    FYI: 10k items `np.ndarray` of 1k different integers: @Jamie's pure `numpy` method is fastest, @Denis Shcheglov's way is 20% slower (probably because you do the sorting twice (`argsort` + `unique`). `pandas` is 20x slower. Same seed for random generator, tested with `timeit` (numpy 1.16.2, pandas 0.24.2) – Gunther Struyf May 08 '19 at 13:46
2

The numpy_indexed package (disclaimer: I am its author) implements a solution inspired by Jaime's; but with tests, a nice interface, and a lot of related functionality:

import numpy_indexed as npi
unique, idx_groups = npi.group_by(a, np.arange(len(a))
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
1

Simple and quick solution.

a = np.array([0, 0, 0, 1, 1, 3, 3, 3, 2, 2, 2, 0, 0, 1, 4])
sort_idx = np.argsort(a)
unique, counts = np.unique(a, return_counts=True)
b = {key: sort_idx[sum(counts[:key]): sum(counts[:key]) + counts[key]] for key in unique}
Denis Shcheglov
  • 113
  • 1
  • 4
1

I know this is an old question, but I was recently working on a similar thing where performance is critical and so I experimented extensively with timing. I hope my findings will be beneficial to the community.

Jaime's solution based on np.unique is the fastest algorithm possible in Python, but with one caveat: the indices are not ordered (because numpy uses quicksort by default) and the result is different from OP's original algorithm (called hereafter naive). Using the stable option fixes it, but it slows things down a bit.

The naive method can be improved using Python's built-in array module like this:

import array
from collections import defaultdict

a = np.array(...)  # 1D, int array
d = defaultdict(lambda: array.array("L"))
alist = array.array("L")
alist.frombytes(a.tobytes())
for n in range(len(alist)):
    d[alist[n]].append(n)

It's just fractions slower than Jaime's solution with stable sort.

Here's some testing done on my platform with Python 3

Best of 5
Naive method: 0.21274029999999988 s
Naive improved: 0.13265090000000002 s
Unique quick: 0.073496 s
Unique stable: 0.1235801999999997 s

The results from the naive method, naive improved, and unique stable are dictionaries with sorted lists of indices. Indices from unique quick are not sorted.

The benchmark code

import array
import timeit
from collections import defaultdict

import numpy as np

def count_naive(a):
    d = defaultdict(list)
    for n, e in enumerate(a):
        d[e].append(n)
    return dict(d)

def count_improved(a):
    d = defaultdict(lambda: array.array("L"))
    alist = array.array("L")
    alist.frombytes(a.tobytes())
    for n in range(len(alist)):
        d[alist[n]].append(n)
    return {n: indices.tolist() for n, indices in d.items()}

def count_unique(a):
    sorted_idx = np.argsort(a)  # , kind='stable')
    counts = np.bincount(a)
    split_idx = np.split(sorted_idx, np.cumsum(counts[:-1]))
    return {n: indices.tolist() for n, indices in enumerate(split_idx)}

def count_stable(a):
    sorted_idx = np.argsort(a, kind="stable")
    counts = np.bincount(a)
    split_idx = np.split(sorted_idx, np.cumsum(counts[:-1]))
    return {n: indices.tolist() for n, indices in enumerate(split_idx)}

a = np.random.randint(1000, size=1000000)

trials = 5
t_naive = timeit.repeat("count_naive(a)", globals=globals(), repeat=trials, number=1)
t_improved = timeit.repeat("count_improved(a)", globals=globals(), repeat=trials, number=1)
t_unique = timeit.repeat("count_unique(a)", globals=globals(), repeat=trials, number=1)
t_stable = timeit.repeat("count_stable(a)", globals=globals(), repeat=trials, number=1)

print(f"Best of {trials}")
print(f"Naive method: {min(t_naive)} s")
print(f"Naive improved: {min(t_improved)} s")
print(f"Unique quick: {min(t_unique)} s")
print(f"Unique stable: {min(t_stable)} s")

N.B. All functions are written in a way that they all return Dict[int, list] so the results can be directly compared.

tboschi
  • 124
  • 11