split integer list into bins, retain indices

Question

I have a large NumPy integer array with a distinct set of values, e.g.,

[0, 1, 0, 0, 0, 2, 2]

From this, I would like to get all values along with a set of indices where they occur. The following works, but the explicit comparison == appears less than optional to me.

import numpy as np

arr = [0, 1, 0, 0, 0, 2, 2]
vals = np.unique(arr)

d = {val: np.where(arr == val)[0] for val in vals}

print(d)

{0: array([0, 2, 3, 4]), 1: array([1]), 2: array([5, 6])}

Any better ideas?

Andrej Kesely · Answer 1 · 2022-04-07T23:56:33.633

Another solution:

arr = np.array([0, 1, 0, 0, 0, 2, 2])

a = arr.argsort()
v, cnt = np.unique(arr, return_counts=True)
x = dict(zip(v, np.split(a, cnt.cumsum()[:-1])))
print(x)

Prints:

{0: array([0, 2, 3, 4]), 1: array([1]), 2: array([5, 6])}

But the speed-up depends on your data (how big is the array, how many unique elements are in the array...)

Some benchmark (Ubuntu 20.04 on AMD 3700x, Python 3.9.7, numpy==1.21.5):

import perfplot

NUM_UNIQUE_VALUES = 10


def make_data(n):
    return np.random.randint(0, NUM_UNIQUE_VALUES, n)


def k1(arr):
    vals = np.unique(arr)
    return {val: np.where(arr == val)[0] for val in vals}


def k2(arr):
    a = arr.argsort()
    v, cnt = np.unique(arr, return_counts=True)
    return dict(zip(v, np.split(a, cnt.cumsum()[:-1])))


perfplot.show(
    setup=make_data,
    kernels=[k1, k2],
    labels=["Nico", "Andrej"],
    equality_check=None,
    n_range=[2 ** k for k in range(1, 25)],
    xlabel="2**N",
    logx=True,
    logy=True,
)

With NUM_UNIQUE_VALUES = 10:

With NUM_UNIQUE_VALUES = 1024:

Getting bins from array of 1 million elements (changing only number of unique values):

def make_data(n):
    return np.random.randint(0, n, 1_000_000)

score 0 · Answer 2 · answered Apr 07 '22 at 22:59

Here's an alternative, but I don't think this is any better. This creates an "index" array, inserts that as a second column, and sorts the rows. You'll see the final result has the values in order, with their original indices in the second column.

>>> arr = np.array([0,1,0,0,0,2,2])
>>> ndx = np.arange(arr.shape[0])
>>> ndx
array([0, 1, 2, 3, 4, 5, 6])
>>> both = np.vstack((arr,ndx)).T
>>> both
array([[0, 0],
       [1, 1],
       [0, 2],
       [0, 3],
       [0, 4],
       [2, 5],
       [2, 6]])
>>> both[both[:,0].argsort()]
array([[0, 0],
       [0, 2],
       [0, 3],
       [0, 4],
       [1, 1],
       [2, 5],
       [2, 6]])

split integer list into bins, retain indices

2 Answers2