23

I am seeing behaviour with numpy bincount that I cannot make sense of. I want to bin the values in a 2D array in a row-wise manner and see the behaviour below. Why would it work with dbArray but fail with simarray?

>>> dbArray
array([[1, 0, 1, 0, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 0, 1, 1],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 1, 0, 1, 0]])
>>> N.apply_along_axis(N.bincount,1,dbArray)
array([[2, 3],
       [0, 5],
       [1, 4],
       [4, 1],
       [3, 2],
       [3, 2]], dtype=int64)
>>> simarray
array([[2, 0, 2, 0, 2],
       [2, 1, 2, 1, 2],
       [2, 1, 1, 1, 2],
       [2, 0, 1, 0, 1],
       [1, 0, 1, 1, 2],
       [1, 1, 1, 1, 1]])
>>> N.apply_along_axis(N.bincount,1,simarray)

Traceback (most recent call last):
  File "<pyshell#31>", line 1, in <module>
    N.apply_along_axis(N.bincount,1,simarray)
  File "C:\Python27\lib\site-packages\numpy\lib\shape_base.py", line 118, in apply_along_axis
    outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (3)
James
  • 387
  • 1
  • 3
  • 8

3 Answers3

24

The problem is that bincount isn't always returning the same shaped objects, in particular when values are missing. For example:

>>> m = np.array([[0,0,1],[1,1,0],[1,1,1]])
>>> np.apply_along_axis(np.bincount, 1, m)
array([[2, 1],
       [1, 2],
       [0, 3]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([2, 1]), array([1, 2]), array([0, 3])]

works, but:

>>> m = np.array([[0,0,0],[1,1,0],[1,1,0]])
>>> m
array([[0, 0, 0],
       [1, 1, 0],
       [1, 1, 0]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([3]), array([1, 2]), array([1, 2])]
>>> np.apply_along_axis(np.bincount, 1, m)
Traceback (most recent call last):
  File "<ipython-input-49-72e06e26a718>", line 1, in <module>
    np.apply_along_axis(np.bincount, 1, m)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/shape_base.py", line 117, in apply_along_axis
    outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (1)

won't.

You could use the minlength parameter and pass it using a lambda or partial or something:

>>> np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=1, arr=m)
array([[3, 0],
       [1, 2],
       [1, 2]])
fabian789
  • 8,348
  • 4
  • 45
  • 91
DSM
  • 342,061
  • 65
  • 592
  • 494
  • 1
    Is there a way to make this work when the `weights` argument for `np.bincount` is provided? So far, I've had no luck: https://stackoverflow.com/questions/55603803/can-i-parallelize-numpy-bincount-using-xarray-apply-ufunc?noredirect=1#comment97903211_55603803 – takachanbo Apr 10 '19 at 16:29
  • you may like this answer of mine on how to do this in one line with pandas. https://stackoverflow.com/a/50233144/3219573 – bonobo Mar 18 '21 at 13:59
4

As @DSM has already mentioned, bincount of a 2d array cannot be done without knowing the maximum value of the array, because it would mean an inconsistency of array sizes.

But thanks to the power of numpy's indexing, it was fairly easy to make a faster implementation of 2d bincount, as it doesn't use concatenation or anything.

def bincount2d(arr, bins=None):
    if bins is None:
        bins = np.max(arr) + 1
    count = np.zeros(shape=[len(arr), bins], dtype=np.int64)
    indexing = (np.ones_like(arr).T * np.arange(len(arr))).T
    np.add.at(count, (indexing, arr), 1)

    return count

UPD: Here is the 3D version of it. Just dug it up from some old code of mine:

def bincount3d(arr, bins=None):
    if bins is None:
        bins = np.max(arr) + 1
    count = np.zeros(shape=[arr.shape[0], arr.shape[1], bins], dtype=np.int64)
    index2d = np.ones_like(arr) * np.reshape(np.arange(arr.shape[1]), newshape=[1, arr.shape[1], 1])
    index3d = np.ones_like(arr) * np.reshape(np.arange(arr.shape[0]), newshape=[arr.shape[0], 1, 1])
    np.add.at(count, (index3d, index2d, arr), 1)

    return count
winwin
  • 958
  • 7
  • 25
1

This is a function that does exactly what you want, but without any loops.

def sub_sum_partition(a, partition):
    """
    Generalization of np.bincount(partition, a).
    Sums rows of a matrix for each value of array of non-negative ints.

    :param a: array_like
    :param partition: array_like, 1 dimension, nonnegative ints
    :return: matrix of shape ('one larger than the largest value in partition', a.shape[1:]). The i's element is
    the sum of rows j in 'a' s.t. partition[j] == i
    """
    assert partition.shape == (len(a),)
    n = np.prod(a.shape[1:], dtype=int)
    bins = ((np.tile(partition, (n, 1)) * n).T + np.arange(n, dtype=int)).reshape(-1)
    sums = np.bincount(bins, a.reshape(-1))
    if n > 1:
        sums = sums.reshape(-1, *a.shape[1:])
    return sums