Numpy: get the index of the elements of a 1d array as a 2d array

Question

I have a numpy array like this: [1 2 2 0 0 1 3 5]

Is it possible to get the index of the elements as a 2d array? For instance the answer for the above input would be [[3 4], [0 5], [1 2], [6], [], [7]]

Currently I have to loop the different values and call numpy.where(input == i) for each value, which has terrible performance with a big enough input.

`np.argsort([1, 2, 2, 0, 0, 1, 3, 5])` gives `array([3, 4, 0, 5, 1, 2, 6, 7], dtype=int64)`. then you can just compare next elements. — vb_rises, Oct 20 '19 at 03:01

Paul Panzer · Accepted Answer · 2020-02-05T07:17:37.367

Here is an O(max(x)+len(x)) approach using scipy.sparse:

import numpy as np
from scipy import sparse

x = np.array("1 2 2 0 0 1 3 5".split(),int)
x
# array([1, 2, 2, 0, 0, 1, 3, 5])


M,N = x.max()+1,x.size
sparse.csc_matrix((x,x,np.arange(N+1)),(M,N)).tolil().rows.tolist()
# [[3, 4], [0, 5], [1, 2], [6], [], [7]]

This works by creating a sparse matrix with entries at positions (x[0],0), (x[1],1), ... Using the CSC (compressed sparse column) format this is rather simple. The matrix is then converted to LIL (linked list) format. This format stores the column indices for each row as a list in its rows attribute, so all we need to do is take that and convert it to list.

Note that for small arrays argsort based solutions are probably faster but at some not insanely large size this will cross over.

EDIT:

argsort-based numpy-only solution:

np.split(x.argsort(kind="stable"),np.bincount(x)[:-1].cumsum())
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]

If the order of indices within groups doesn't matter you can also try argpartition (it happens to make no difference in this small example but this is not guaranteed in general):

bb = np.bincount(x)[:-1].cumsum()
np.split(x.argpartition(bb),bb)
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]

EDIT:

@Divakar recommends against the use of np.split. Instead, a loop is probably faster:

A = x.argsort(kind="stable")
B = np.bincount(x+1).cumsum()
[A[B[i-1]:B[i]] for i in range(1,len(B))]

Or you could use the brand new (Python3.8+) walrus operator:

A = x.argsort(kind="stable")
B = np.bincount(x)
L = 0
[A[L:(L:=L+b)] for b in B.tolist()]

EDIT(EDITED):

(Not pure numpy): As an alternative to numba (see @senderle's post) we can also use pythran.

Compile with pythran -O3 <filename.py>

import numpy as np

#pythran export sort_to_bins(int[:],int)

def sort_to_bins(idx, mx):
    if mx==-1: 
        mx = idx.max() + 1
    cnts = np.zeros(mx + 2, int)
    for i in range(idx.size):
        cnts[idx[i] + 2] += 1
    for i in range(3, cnts.size):
        cnts[i] += cnts[i-1]
    res = np.empty_like(idx)
    for i in range(idx.size):
        res[cnts[idx[i]+1]] = i
        cnts[idx[i]+1] += 1
    return [res[cnts[i]:cnts[i+1]] for i in range(mx)]

Here numba wins by a whisker performance-wise:

repeat(lambda:enum_bins_numba_buffer(x),number=10)
# [0.6235917090671137, 0.6071486569708213, 0.6096088469494134]
repeat(lambda:sort_to_bins(x,-1),number=10)
# [0.6235359431011602, 0.6264424560358748, 0.6217901279451326]

Older stuff:

import numpy as np

#pythran export bincollect(int[:])

def bincollect(a):
    o = [[] for _ in range(a.max()+1)]
    for i,j in enumerate(a):
        o[j].append(i)
    return o

Timings vs. numba (old)

timeit(lambda:bincollect(x),number=10)
# 3.5732191529823467
timeit(lambda:enumerate_bins(x),number=10)
# 6.7462647299980745

This ended up being a bit faster than @Randy's answer – Frederico Schardong Oct 20 '19 at 13:23 — Frederico Schardong, Oct 20 '19 at 13:23
A loop-based one should be better than `np.split`. – Divakar Feb 02 '20 at 19:30 — Divakar, Feb 02 '20 at 19:30

score 8 · Answer 2 · answered Oct 20 '19 at 03:07

One potential option depending on the size of your data is to just drop out of numpy and use collections.defaultdict:

In [248]: from collections import defaultdict

In [249]: d = defaultdict(list)

In [250]: l = np.random.randint(0, 100, 100000)

In [251]: %%timeit
     ...: for k, v in enumerate(l):
     ...:     d[v].append(k)
     ...:
10 loops, best of 3: 22.8 ms per loop

Then you end up with a dictionary of {value1: [index1, index2, ...], value2: [index3, index4, ...]}. The time scaling is pretty close to linear with the size of the array, so 10,000,000 takes ~2.7s on my machine, which seems reasonable enough.

senderle · Answer 3 · 2020-02-04T14:06:33.870

Although the request is for a numpy solution, I decided to see whether there is an interesting numba-based solution. And indeed there is! Here's an approach that represents the partitioned list as a ragged array stored in a single preallocated buffer. This takes some inspiration from the argsort approach proposed by Paul Panzer. (For an older version that didn't do as well, but was simpler, see below.)

@numba.jit(numba.void(numba.int64[:], 
                      numba.int64[:], 
                      numba.int64[:]), 
           nopython=True)
def enum_bins_numba_buffer_inner(ints, bins, starts):
    for x in range(len(ints)):
        i = ints[x]
        bins[starts[i]] = x
        starts[i] += 1

@numba.jit(nopython=False)  # Not 100% sure this does anything...
def enum_bins_numba_buffer(ints):
    ends = np.bincount(ints).cumsum()
    starts = np.empty(ends.shape, dtype=np.int64)
    starts[1:] = ends[:-1]
    starts[0] = 0

    bins = np.empty(ints.shape, dtype=np.int64)
    enum_bins_numba_buffer_inner(ints, bins, starts)

    starts[1:] = ends[:-1]
    starts[0] = 0
    return [bins[s:e] for s, e in zip(starts, ends)]

This processes a ten-million item list in 75ms, which is nearly a 50x speedup from a list-based version written in pure Python.

For a slower but somewhat more readable version, here's what I had before, based on recently added experimental support for dynamically sized "typed lists," which allow us to fill up each bin in an out-of-order way much more quickly.

This wrestles with numba's type inference engine a bit, and I'm sure there's a better way to handle that part. This also turns out to be almost 10x slower than the above.

@numba.jit(nopython=True)
def enum_bins_numba(ints):
    bins = numba.typed.List()
    for i in range(ints.max() + 1):
        inner = numba.typed.List()
        inner.append(0)  # An awkward way of forcing type inference.
        inner.pop()
        bins.append(inner)

    for x, i in enumerate(ints):
        bins[i].append(x)

    return bins

I tested these against the following:

def enum_bins_dict(ints):
    enum_bins = defaultdict(list)
    for k, v in enumerate(ints):
        enum_bins[v].append(k)
    return enum_bins

def enum_bins_list(ints):
    enum_bins = [[] for i in range(ints.max() + 1)]
    for x, i in enumerate(ints):
        enum_bins[i].append(x)
    return enum_bins

def enum_bins_sparse(ints):
    M, N = ints.max() + 1, ints.size
    return sparse.csc_matrix((ints, ints, np.arange(N + 1)),
                             (M, N)).tolil().rows.tolist()

I also tested them against a precompiled cython version similar to enum_bins_numba_buffer (described in detail below).

On a list of ten million random ints (ints = np.random.randint(0, 100, 10000000)) I get the following results:

enum_bins_dict(ints)
3.71 s ± 80.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_list(ints)
3.28 s ± 52.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_sparse(ints)
1.02 s ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_numba(ints)
693 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_cython(ints)
82.3 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

enum_bins_numba_buffer(ints)
77.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Impressively, this way of working with numba outperforms a cython version of the same function, even with bounds-checking turned off. I don't yet have enough familiarity with pythran to test this approach using it, but I'd be interested to see a comparison. It seems likely based on this speedup that the pythran version might also be quite a bit faster with this approach.

Here's the cython version for reference, with some build instructions. Once you have cython installed, you'll need a simple setup.py file like this:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy

ext_modules = [
    Extension(
        'enum_bins_cython',
        ['enum_bins_cython.pyx'],
    )
]

setup(
    ext_modules=cythonize(ext_modules),
    include_dirs=[numpy.get_include()]
)

And the cython module, enum_bins_cython.pyx:

# cython: language_level=3

import cython
import numpy
cimport numpy

@cython.boundscheck(False)
@cython.cdivision(True)
@cython.wraparound(False)
cdef void enum_bins_inner(long[:] ints, long[:] bins, long[:] starts) nogil:
    cdef long i, x
    for x in range(len(ints)):
        i = ints[x]
        bins[starts[i]] = x
        starts[i] = starts[i] + 1

def enum_bins_cython(ints):
    assert (ints >= 0).all()
    # There might be a way to avoid storing two offset arrays and
    # save memory, but `enum_bins_inner` modifies the input, and
    # having separate lists of starts and ends is convenient for
    # the final partition stage.
    ends = numpy.bincount(ints).cumsum()
    starts = numpy.empty(ends.shape, dtype=numpy.int64)
    starts[1:] = ends[:-1]
    starts[0] = 0

    bins = numpy.empty(ints.shape, dtype=numpy.int64)
    enum_bins_inner(ints, bins, starts)

    starts[1:] = ends[:-1]
    starts[0] = 0
    return [bins[s:e] for s, e in zip(starts, ends)]

With these two files in your working directory, run this command:

python setup.py build_ext --inplace

You can then import the function using from enum_bins_cython import enum_bins_cython.

I wonder if you are aware of pythran which in very broad terms is similar to numba. I added a pythran solution to my post. On this occasion pythran would appear to have the upper hand, delivering a faster and much more pythonic solution. — Paul Panzer, Feb 04 '20 at 07:06
@PaulPanzer interesting! I hadn't heard of it. I gather that the numba devs will be adding the expected syntactic sugar once the List code is stable. There also seems to be a convenience/speed trade-off here — the jit decorator is very easy to integrate into an ordinary Python code base compared to an approach requiring separate precompiled modules. But a 3x speedup over the scipy approach is indeed impressive, even surprising! — senderle, Feb 04 '20 at 07:56
Just remembered that I had basically done this before: https://stackoverflow.com/q/55226662/7207392. Would you mind adding your numba and cython versions to that Q&A? Only difference is: we don't bin indices 0,1,2,... but instead another array. And we don't bother actually chopping up the resultant array. — Paul Panzer, Feb 04 '20 at 14:54
@PaulPanzer ah very cool. I'll try to add it at some point today or tomorrow. Are you suggesting a separate answer or just an edit to your answer? Happy either way! — senderle, Feb 04 '20 at 16:14
Great! I think a separate post would be better but no strong preference. — Paul Panzer, Feb 04 '20 at 16:42

Daniel F · Answer 4 · 2020-02-04T11:39:42.503

Here's a really really weird way to do this that's terrible, but I found it too funny to not share - and all numpy!

out = np.array([''] * (x.max() + 1), dtype = object)
np.add.at(out, x, ["{} ".format(i) for i in range(x.size)])
[[int(i) for i in o.split()] for o in out]

Out[]:
[[3, 4], [0, 5], [1, 2], [6], [], [7]]

EDIT: this is the best method I could find along this path. It's still 10x slower than @PaulPanzer 's argsort solution:

out = np.empty((x.max() + 1), dtype = object)
out[:] = [[]] * (x.max() + 1)
coords = np.empty(x.size, dtype = object)
coords[:] = [[i] for i in range(x.size)]
np.add.at(out, x, coords)
list(out)

score 2 · Answer 5 · answered Feb 02 '20 at 16:30

You can do it by making a dictionary of numbers, keys would be the numbers and values should be the indices that number seen, this is one of the fastest ways to do it, you can see the code bellow :

>>> import numpy as np
>>> a = np.array([1 ,2 ,2 ,0 ,0 ,1 ,3, 5])
>>> b = {}
# Creating an empty list for the numbers that exist in array a
>>> for i in range(np.min(a),np.max(a)+1):
    b[str(i)] = []

# Adding indices to the corresponding key
>>> for i in range(len(a)):
    b[str(a[i])].append(i)

# Resulting Dictionary
>>> b
{'0': [3, 4], '1': [0, 5], '2': [1, 2], '3': [6], '4': [], '5': [7]}

# Printing the result in the way you wanted.
>>> for i in sorted (b.keys()) :
     print(b[i], end = " ")

[3, 4] [0, 5] [1, 2] [6] [] [7]

score 1 · Answer 6 · answered Oct 20 '19 at 03:24

Pseudocode:

get the "number of 1d arrays in the 2d array", by subtracting the minimum value of your numpy array from the maximum value and then plus one. In your case, it will be 5-0+1 = 6
initialize a 2d array with the number of 1d arrays within it. In your case, initialize a 2d array with 6 1d array in it. Each 1d array corresponds to a unique element in your numpy array, for example, the first 1d array will correspond to '0', the second 1d array will correspond to '1',...
loop through your numpy array, put the index of the element into the right corresponding 1d array. In your case, the index of the first element in your numpy array will be put to the second 1d array, the index of the second element in your numpy array will be put to the third 1d array, ....

This pseudocode will take linear time to run as it depends on the length of your numpy array.

score 1 · Answer 7 · answered Feb 06 '20 at 22:23

This gives you exactly what you want and would take about 2.5 seconds for 10,000,000 on my machine:

import numpy as np
import timeit

# x = np.array("1 2 2 0 0 1 3 5".split(),int)
x = np.random.randint(0, 100, 100000)

def create_index_list(x):
    d = {}
    max_value = -1
    for i,v in enumerate(x):
        if v > max_value:
            max_value = v
        try:
            d[v].append(i)
        except:
            d[v] = [i]
    result_list = []
    for i in range(max_value+1):
        if i in d:
            result_list.append(d[i])
        else:
            result_list.append([])
    return result_list

# print(create_index_list(x))
print(timeit.timeit(stmt='create_index_list(x)', number=1, globals=globals()))

Ramsha Siddiqui · Answer 8 · 2020-02-09T09:24:28.967

0

So given a list of elements, you want to make (element, index) pairs. In linear time, this could be done as:

hashtable = dict()
for idx, val in enumerate(mylist):
    if val not in hashtable.keys():
         hashtable[val] = list()
    hashtable[val].append(idx)
newlist = sorted(hashtable.values())

This should take O(n) time. I can not think of a faster solution as of now, but will update here if I do.

edited Feb 09 '20 at 09:24

answered Feb 08 '20 at 19:51

Ramsha Siddiqui

460
6
20

Numpy: get the index of the elements of a 1d array as a 2d array

8 Answers8