4

Is there a way to apply bincount with "axis = 1"? The desired result would be the same as the list comprehension:

import numpy as np
A = np.array([[1,0],[0,0]])
np.array([np.bincount(r,minlength = np.max(A) + 1) for r in A])

#array([[1,1]
#       [2,0]])
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • `bincount` is compiled (for speed) and requires a 1d array. So your expression looks good. As you seem to realize, reassembling the result into an array requires a consistent number of bins. That issue may be why `bincount` is 1d - its application to the rows of a general 2d array will produce a ragged list. – hpaulj Jan 13 '16 at 00:19
  • I guess that makes sense, but don't you think that my solution below is kind of a weird hack? It's a pretty common situation in machine learning to have to compute row-wise counts in this way. – maxymoo Jan 13 '16 at 00:54
  • 2
    A 2013 question: Can numpy bincount work with 2D arrays?: http://stackoverflow.com/questions/19201972/can-numpy-bincount-work-with-2d-arrays – hpaulj Jan 13 '16 at 01:30
  • ah thanks for that, didn't know about `apply_along_axis` ... unfortunately it doesn't seem to work well for large matrices, on my data the accepted answer is takes 10.1s on my example below, i guess the apply isn't very optimized... – maxymoo Jan 13 '16 at 01:37
  • If the values in each row are unique (relative to the other rows), you could `bincount` the flattened array, and then separate the counts. You could make the values unique by adding suitably large offsets. – hpaulj Jan 13 '16 at 01:40
  • 1
    `apply_along_axis` isn't that magical. Look at it's code; it's just a fancy form of looping. – hpaulj Jan 13 '16 at 01:40
  • 3
    Here's a [link](https://github.com/numpy/numpy/pull/4330) to a pull request that added the functionality you are asking about to NumPy. It never got merged, because it was felt it complicated the code base more than the feature was worth it. – Jaime Jan 13 '16 at 05:13

3 Answers3

5

np.bincount doesn't work with a 2D array along a certain axis. To get the desired effect with a single vectorized call to np.bincount, one can create a 1D array of IDs such that different rows would have different IDs even if the elements are the same. This would keep elements from different rows not binning together when using a single call to np.bincount with those IDs. Thus, such an ID array could be created with an idea of linear indexing in mind, like so -

N = A.max()+1
id = A + (N*np.arange(A.shape[0]))[:,None]

Then, feed the IDs to np.bincount and finally reshape back to 2D -

np.bincount(id.ravel(),minlength=N*A.shape[0]).reshape(-1,N)
Divakar
  • 218,885
  • 19
  • 262
  • 358
2

If the data is too large for this to be efficient, then the issue is more likely to be the memory usage of the dense matrix rather than the numerical operations themself. Here is an example of using a sklearn Hashing Vectorizer on a matrix which is too large to use the bincounts method (the results are a sparse matrix):

import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
h = HashingVectorizer()
A = np.random.randint(100,size=(1000,100))*10000
A_str = [" ".join([str(v) for v in i]) for i in A]

%timeit h.fit_transform(A_str)
#10 loops, best of 3: 110 ms per loop
maxymoo
  • 35,286
  • 11
  • 92
  • 119
2

You can use apply_along_axis, Here is an example

import numpy as np
test_array = np.array([[0, 0, 1], [0, 0, 1]])
print(test_array)
np.apply_along_axis(np.bincount, axis=1, arr= test_array,
                                          minlength = np.max(test_array) +1)

Note the final shape of this array depends on the number of bins, also you can specify other arguments along with apply_along_axis

sushmit
  • 4,369
  • 2
  • 35
  • 38