4

I need to find duplicate numbers in multiple one-dimensional arrays and the number of repetitions for each repetition, This is good for one-dimensional arrays np.unique, but does not seem to apply to two-dimensional arrays, I have searched for similar answers, but I need a more detailed report.(The number of occurrences of all numbers, the position index)

Can numpy bincount work with 2D arrays? This answer does not match, I hope to get a map containing more information on some of the data, such as a number of the most, and I do not like recycling, maybe this is not appropriate, but I will try to find ways to not use a loop,Because I have a very harsh demand for speed.

For example:

a = np.array([[1,2,2,2,3],
              [0,1,1,1,2],
              [0,0,0,1,0]])

# The number of occurrences for each number
# int  count
# 0.     0
# 1.     1
# 2.     3
# 3.     1

#need the output:
#Index = the number of statistics, the number of repetitions
[[0 1 3 1]  
 [1 3 1 0]
 [4 1 0 0]]

Because this is part of the loop, you need an efficient way of vectoring to complete more rows of statistics at once, and try to avoid looping again.

I've used packet aggregation to count the results. The function does this by constructing a key1 that differentiates rows, the data itself as key2, and a two-dimensional array of all 1s, Although able to output, but I think it is only temporary measures.Need the right way.

from numpy_indexed import group_by

def unique2d(x):
    x = x.astype(int); mx = np.nanmax(x)+1

    ltbe = np.tile(np.arange(x.shape[0])[:,None],(1,x.shape[1]))

    vtbe = np.zeros(x.shape).astype(int) + 1

    groups = npi.group_by((ltbe.ravel(),x.ravel().astype(int)))
    unique, median = groups.sum(vtbe.ravel())

    ctbe = np.zeros(x.shape[0]*mx.astype(int)).astype(int)
    ctbe[(unique[0] * mx + unique[1]).astype(int)] = median
    ctbe.shape=(x.shape[0],mx)

    return ctbe

unique2d(a)

>array([[0, 1, 3, 1],
        [1, 3, 1, 0],
        [4, 1, 0, 0]])

Hope there are good suggestions and algorithms, thanks

weidong
  • 159
  • 8
  • Possible duplicate of [Can numpy bincount work with 2D arrays?](https://stackoverflow.com/questions/19201972/can-numpy-bincount-work-with-2d-arrays) – jdehesa Feb 23 '18 at 11:00
  • What you would need is [`np.bincount`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html) with an axis argument, but that is not implemented as of now (see issues [#8495](https://github.com/numpy/numpy/issues/8495) and [#9397](https://github.com/numpy/numpy/issues/9397)). As suggested in the linked question, you can use it with [`apply_along_axis`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.apply_along_axis.html) for now. – jdehesa Feb 23 '18 at 11:06
  • Unless the possibility is 0, I will reject any obvious way to loop – weidong Feb 23 '18 at 11:42
  • apply_along_axis is just another syntax for looping – Eelco Hoogendoorn Feb 23 '18 at 12:05

1 Answers1

0

The fewest lines of code I can come up with is as follows:

import numpy as np
import numpy_indexed as npi

a = np.array([[1,2,2,2,3],
              [0,1,1,1,2],
              [0,0,0,1,0]])

row_idx = np.indices(a.shape, dtype=np.int32)[0]
axes, table = npi.Table(row_idx.flatten(), a.flatten()).count()

I havnt profiled this, but it does not contain any hidden un-vectorized for-loops; and I doubt you could do it much faster in numpy by any means. Nor do I expect it to perform a whole lot faster than your current solution though. Using the smallest possible int-types may help.

Note that this function does not assume that the elements of a form a contiguous set; the axes labels are returned in the axes tuple; that may or may not be the behavior you are looking for. Modifying the code in the Table class to conform to your current layout shouldnt be hard though.

If speed is your foremost concern; your problem would probably map really well to numba.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • The conclusion is surprising, the result is completely correct, but based on the group_by statistics almost twice the speed of the table (1000 * 1000) I am not sure whether to adopt a cleaner and clearer way, I generated all that group_by 1,The target array is not happy.But still thank you for your reply – weidong Feb 23 '18 at 13:47
  • hmm; not sure what is going on there. Table uses np.add.at; which I have often found is strangely slow compared to the np.add.reduce, that groupby.sum uses; that could be it. But in any case for this kind of problem I think numba is easily 10 times faster. – Eelco Hoogendoorn Feb 23 '18 at 14:08
  • Thank you for your advice I will try numba – weidong Feb 23 '18 at 14:48