3
>>> import numpy_indexed as npi
>>> import numpy as np
>>> a = np.array([[0,0,1,1,2,2], [4,4,8,8,10,10]]).T
>>> a
array([[ 0,  4],
       [ 0,  4],
       [ 1,  8],
       [ 1,  8],
       [ 2, 10],
       [ 2, 10]])
>>> npi.group_by(a[:, 0]).sum(a[:,1])

(array([0, 1, 2]), array([ 8, 16, 20], dtype=int32))

I want to perform calculations on subsets of the second column clustered by the first column on large sets (~1m lines). Is there an efficient (and/or vectorised) way to use the output of group_by by numpy_indexed in order to add a new column with the output of these calculations? In the example of sum as above I would like to produce the output below.

If there is an efficient way of doing this without using numpy_indexed in the first place, that would also be very helpful.

array([[ 0,  4,  8],
       [ 0,  4,  8],
       [ 1,  8, 16],
       [ 1,  8, 16],
       [ 2, 10, 20],
       [ 2, 10, 20]])
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
Tony
  • 781
  • 6
  • 22
  • Would the first col be already sorted? Also, would it always start with `0` and have only sequential numbers? – Divakar May 18 '17 at 10:29
  • The column with the categories is always sorted but the group numbers don't necessarily start from `0` but _are_ incrementally numbers. Thank you – Tony May 18 '17 at 10:37

2 Answers2

3

One approach with np.unique to generate those unique tags and the interval shifting indices and then np.add.reduceat for the intervaled-summing -

_,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]

Another way that avoids the use of np.unique and might be beneficial on performance would be like so -

idx = np.r_[0,np.flatnonzero(a[1:,0] > a[:-1,0])+1]
tag_arr = np.zeros(a.shape[0], dtype=int)
tag_arr[idx[1:]] = 1
tags = tag_arr.cumsum()
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]

For further performance boost, we should use np.bincount. Thus, np.add.reduceat(a[:,1],idx) could be replaced by np.bincount(tags, a[:,1]).

Sample run -

In [271]: a    # Using a more generic sample
Out[271]: 
array([[11,  4],
       [11,  4],
       [14,  8],
       [14,  8],
       [16, 10],
       [16, 10]])

In [272]: _,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)

In [273]: np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
Out[273]: 
array([[11,  4,  8],
       [11,  4,  8],
       [14,  8, 16],
       [14,  8, 16],
       [16, 10, 20],
       [16, 10, 20]])]

Now, the listed approaches assume that the first column is already sorted. If that's not the case, we need to sort the array by the first column argsort and then use the proposed method. Thus, for the not sorted case, we need the following as pre-processing -

a = a[a[:,0].argsort()]

Battle against np.unique

Let's time the custom flatnonzero + cumsum based method against the built-in np.unique to create the shifting indices : idx and the uniqueness based IDs/tags : tags. For a case like this one, where we know beforehand that the labels column is already sorted, we are avoiding any sorting, as done with np.unique. This gives us an advantage on performance. So, let's verify it.

Approaches -

def nonzero_cumsum_based(A):
    idx = np.concatenate(( [0] ,np.flatnonzero(A[1:] > A[:-1])+1 ))
    tags = np.zeros(len(A), dtype=int)
    tags[idx[1:]] = 1
    np.cumsum(tags, out = tags)
    return idx, tags

def unique_based(A):
    _,idx,tags = np.unique(A, return_index=1, return_inverse=1)
    return idx, tags

Sample run with the custom func -

In [438]: a
Out[438]: 
array([[11,  4],
       [11,  4],
       [14,  8],
       [14,  8],
       [16, 10],
       [16, 10]])

In [439]: idx, tags = nonzero_cumsum_based(a[:,0])

In [440]: idx
Out[440]: array([0, 2, 4])

In [441]: tags
Out[441]: array([0, 0, 1, 1, 2, 2])

Timings -

In [444]: a = np.c_[np.sort(randi(10,10000,(100000))), randi(0,10000,(100000))]

In [445]: %timeit unique_based(a[:,0])
100 loops, best of 3: 4.3 ms per loop

In [446]: %timeit nonzero_cumsum_based(a[:,0])
1000 loops, best of 3: 486 µs per loop

In [447]: a = np.c_[np.sort(randi(10,10000,(1000000))), randi(0,10000,(1000000))]

In [448]: %timeit unique_based(a[:,0])
10 loops, best of 3: 50.2 ms per loop

In [449]: %timeit nonzero_cumsum_based(a[:,0])
100 loops, best of 3: 3.98 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thank you, may I ask how the first value that is unpacked, `_`, is used? – Tony May 18 '17 at 10:43
  • 1
    @Tony We have three outputs from that step. With that `_`, we are just not storing it, i.e. we don't need that first output. – Divakar May 18 '17 at 10:44
1

Every index object has an inverse property, which maps reduced values back to their original range; for illustration, we could write:

index = npi.as_index(keys)
unique_keys = index.unique
unique_keys[index.inverse] == keys  # <- should be all true

And this property is exposed on the GroupBy object as well; since indeed mapping grouped values back to their input range is a commonly useful operation:

groups = npi.group_by(a[:, 0])
unique, sums = groups.sum(a[:, 1])
new_column = sums[groups.inverse]

In general, the source of numpy_indexed can be an inspiration for how to perform such common operations; group_by.var faces the same problem for instance, of broadcasting the means per group back to each element of the group of which it was formed, to compute the errors in each group. But better tutorials certainly wouldn't hurt either.

Could you give an even higher level description of the problem you are trying to solve? Chances are you can simplify your code even more from a higher level, when you get more comfortable thinking in terms of the design patterns that npi makes convenient.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Thank you for your answer. If I define `a` as in my initial post and then run the two lines you are presenting above, the first one returns an `npi` object, but the second returns `TypeError: only integer scalar arrays can be converted to a scalar index`. I'm sure the answer is obvious to you but not to me so any help please? Cheers – Tony May 18 '17 at 12:38
  • 1
    sorry, small mistake on my part indeed; still havn't actually ran my code but I am fairly sure it should be correct now! – Eelco Hoogendoorn May 18 '17 at 13:06
  • Eelco Hoogendoorn `npi` works great and thanks for posting here. I have a 1-dimensional `np.array` for my groups and a 2-dimensional `array` for my xs and I really like the fact that it picks up the second dimension and does it separately for each column when I do `_, means= groups.means(Xs_df); means[groups.inverse]` – Tony May 19 '17 at 16:55
  • I have a question though; because my set is quite large, consisting of 0/1 dummies (.5m x 4k), it takes quite a while to calculate the averages per group per column. Is it possible to utilise `npi` with `scipy`'s sparse matrices to make it run faster? – Tony May 19 '17 at 17:00
  • 1
    That sounds a bit more complex to fully explain than the space here permits; perhaps you could work out your question a bit more as an edit to the original question? – Eelco Hoogendoorn May 19 '17 at 18:52