2

Say I have the following numpy array:

a = np.arange(20)

And also an array containing indices as follows:

ix = np.array([4,10,15])

I've been trying to come up with a vectorized solution to the following question: How can I apply a function along a being splitted using the indices in ix?

So say I where to split a with np.split (I'm only using np.split here to illustrate the groups to which I would like to apply a function here):

np.split(a,ix)

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

And say for instance I'd like to take the sum on each chunk, so giving:

[6, 39, 60, 85]

How could I vectorize this using numpy?

yatu
  • 86,083
  • 12
  • 84
  • 139
  • I don't think you can vectorize your ops on that kind of array. Can you use pandas? – Dani Mesejo Feb 05 '19 at 18:23
  • Well yes I could but looking for a numpy solution for a better performance – yatu Feb 05 '19 at 18:24
  • You are looking for sum only or for a general solution? – Dani Mesejo Feb 05 '19 at 18:26
  • General. Was just an example – yatu Feb 05 '19 at 18:26
  • A solution with sum is good enough though, the idea should be the same – yatu Feb 05 '19 at 18:29
  • In general I believe that all numpy solutions will pass for filling the holes in the array so, you can fill the holes with 0 for sum, but that will not help for mutiplicatio for instance. – Dani Mesejo Feb 05 '19 at 18:31
  • My specific problem consists on finding the minimum and maximum of each chunk. So if sum works that should too – yatu Feb 05 '19 at 18:33
  • I have an inkling from an answer I saw and used from Divakar a while back; unfortunately that means I have little memory of how it works till I get back home to see the code in a couple of hours. – roganjosh Feb 05 '19 at 18:45
  • 1
    Take a look at this solution: https://stackoverflow.com/a/32043366/4001592 – Dani Mesejo Feb 05 '19 at 18:45
  • 1
    It wasn't that one in the case im thinking of but that guy has some serious tricks up his sleeve so it's a decent bet he has solved something similar – roganjosh Feb 05 '19 at 18:47
  • Yes! That could indeed be a first step to get a proper ndarray. Thanks @DanielMesejo – yatu Feb 05 '19 at 19:13

3 Answers3

1

I do not know if this is the best solution, but you could convert the list of arrays with different sizes to list of array of fixed size by adding zeros. And then implement a function like sum that does not get affected by zeros.

See an example below.

a = np.arange(20)
ix = np.array([4,10,15])
b = np.split(a,ix)
print(b)

results in

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

Then use itertools to convert list to array from here

import itertools
c = np.array(list(itertools.zip_longest(*b, fillvalue=0))).T
print(c)

which results in

[[ 0  1  2  3  0  0]
 [ 4  5  6  7  8  9]
 [10 11 12 13 14  0]
 [15 16 17 18 19  0]]

then sum it using

np.sum(c, axis = 1)

results in

array([ 6, 39, 60, 85])
plasmon360
  • 4,109
  • 1
  • 16
  • 19
  • Pretty nice solution even though it's not numpy based, thanks for sharing! – yatu Feb 05 '19 at 18:35
  • Don't dismiss `zip_longest` just because it doesn't have a `np.` prefix! It's one the nicer ways of padding lists to a uniform length. – hpaulj Feb 05 '19 at 19:49
  • Yes I think this solution is indeed nice! I'm only stressing on that fact because I did explicitly mention that I was looking for a numpy solution – yatu Feb 05 '19 at 20:10
  • 1
    But why do you want a numpy solution? – hpaulj Feb 05 '19 at 21:18
  • Because I would be applying the solution to n arrays as a and ix. So ideally I would like to stack them together and perform vectorized operations over them. I strart of with a 3dim array, perform some operations and end up with these 2dim arrays a and ix. Made it a 1dim array here for simplicity. But it should be the same if the solution is numpy based. If python loops are used it won't scale as well. – yatu Feb 06 '19 at 07:12
1

split produces a list of arrays, which may differ in length. It actually does so iteratively

In [12]: alist = []
In [13]: alist.append(a[0:idx[0]])
In [14]: alist.append(a[idx[0]:idx[1]])
In [15]: alist.append(a[idx[1]:idx[2]])
....

Applying sum to each element of the list individually makes sense:

In [11]: [np.sum(row) for row in alist]
Out[11]: [6, 39, 60, 85]

When you have a list of arrays that differ in shape, it's a good bet that you'll have to do a Python level iteration on it.

Fast 'vectorize' means performing the calculations in compiled code. Most that is built around multidimensional arrays, e.g. 2d ones. If your split had produced equal size array, you could use np.sum with the appropriate axis parameter.

In [23]: a1 = a.reshape(4,5)
In [24]: a1
Out[24]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])
In [25]: np.sum(a1, axis=1)
Out[25]: array([10, 35, 60, 85])

Sometimes we can play tricks to cast the problem into a n-d one, for example if your first array of the split were padded with a 0. But that casting itself might require iteration.

As raised here (and its links) Origin of AttributeError: object has no attribute 'cos' math (ufunc) functions applied to object dtype arrays, ends up delegating the action to corresponding methods of the objects. But that still involves a (near)Python level iteration over the objects.


Some timings:

In [57]: timeit [np.sum(row) for row in alist]
31.7 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [58]: timeit np.sum(list(itertools.zip_longest(*alist, fillvalue=0)),axis=0)
25.2 µs ± 82 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [59]: timeit np.nansum(pd.DataFrame(alist), axis=1)
908 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [61]: timeit np.frompyfunc(sum,1,1)(alist)
12.9 µs ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In this last case the Python sum is faster than than np.sum. But that's true with the list comprehension as well:

In [63]: timeit [sum(row) for row in alist]
6.86 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And with Divakar's wiz-bang fillna, Numpy: Fix array with rows of different lengths by filling the empty elements with zeros

In [70]: timeit numpy_fillna(np.array(alist)).sum(axis=1)
44.2 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Once you have a multidimensional array, the numpy code is fast. But if start with a list, even a list of arrays, Python list methods often are faster. The time it takes to construct an array (or Dataframe) is never trivial.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Yes I'm aware that each array may differ in length. No. Split here was to make the example easy to understand. Just wondering if there was an alternative to python level iteration – yatu Feb 05 '19 at 18:32
  • The core of true vectorization is to perform the action in compiled code, using the standard numpy building blocks. In most cases that requires rectangular multidimensional arrays. – hpaulj Feb 05 '19 at 18:41
1

A pandas solution will be:

import numpy as np
import pandas as pd

a = np.arange(20)

ix = np.array([4, 10, 15])

data = pd.DataFrame(np.split(a, ix))

print(np.nansum(data, axis=1))

Output

[ 6. 39. 60. 85.]
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • This works because `DataFrame` pads the short rows with `NaN`. But creating frame is a very slow process. My list comprehension is 30x faster. – hpaulj Feb 05 '19 at 19:47