1

I implemented codes to try to get maximum occurrence in numpy array. I was satisfactory using numba, but got limitations. I wonder whether it can be improved to a general case.

numba implementation

import numba as nb
import numpy as np
import collections

@nb.njit("int64(int64[:])")
def max_count_unique_num(x):
    """
    Counts maximum number of unique integer in x.

    Args:
        x (numpy array): Integer array.

    Returns:
        Int

    """
    # get maximum value
    m = x[0]
    for v in x:
        if v > m:
            m = v

    if m == 0:
        return x.size

    # count each unique value
    num = np.zeros(m + 1, dtype=x.dtype)
    for k in x:
        num[k] += 1
    # maximum count
    m = 0
    for k in num:
        if k > m:
            m = k
    return m

For comparisons, I also implemented numpy's unique and collections.Counter

def np_unique(x):
    """ Counts maximum occurrence using numpy's unique. """
    ux, uc = np.unique(x, return_counts=True)
    return uc.max()


def counter(x):
    """ Counts maximum occurrence using collections.Counter. """
    counts = collections.Counter(x)
    return max(counts.values())

timeit

Edit: Add np.bincount for additional comparison, as suggested by @MechanicPig.

In [1]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [2]: %timeit max_count_unique_num(x)
30 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: %timeit np_unique(x)
1.14 ms ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit counter(x)
2.68 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [6]: %timeit counter(x)
3.07 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np_unique(x)
1.3 ms ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit max_count_unique_num(x)
490 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [10]: %timeit np.bincount(x).max()
32.3 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [11]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [12]: %timeit np.bincount(x).max()
830 µs ± 6.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The limitations of numba implementation are quite obvious: efficiency only when all values in x are small positive int and will be significantly reduced for very large int; not applicable to float and negative values.

Any way I can generalize the implementation and keep the speed?


Update
After checking the source code of np.unique, an implementation for general cases can be:

@nb.njit(["int64(int64[:])", "int64(float64[:])"])
def max_count_unique_num_2(x):
    x.sort()
    n = 0
    k = 0
    x0 = x[0]
    for v in x:
        if x0 == v:
            k += 1
        else:
            if k > n:
                n = k
            k = 1
            x0 = v
    # for last item in x if it equals to previous one
    if k > n:
        n = k
    return n

timeit

In [154]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [155]: %timeit max_count_unique_num(x)
519 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [156]: %timeit np_unique(x)
1.3 ms ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [157]: %timeit max_count_unique_num_2(x)
240 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [158]: x = np.random.randint(0, 200000, size=300000).astype(np.int64)
In [159]: %timeit max_count_unique_num(x)
1.01 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [160]: %timeit np_unique(x)
18.1 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [161]: %timeit max_count_unique_num_2(x)
3.58 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So:

  1. If large integer in x and the size is not large, max_count_unique_num_2 beats max_count_unique_num.
  2. Both max_count_unique_num and max_count_unique_num_2 are significantly faster than np.unique.
  3. Small modification on max_count_unique_num_2 can return the item that has maximum occurrence, even all items having same maximum occurrence.
  4. max_count_unique_num_2 can even be accelerated if x is itself sorted by removing x.sort().
Elkan
  • 546
  • 8
  • 23
  • Basically, you just want the count of the item with the maximum occurrence using numpy? – Zero May 11 '22 at 02:30
  • @Zero Yes, you can say like this. But would be great if I can get the item. If can't, it's fine too. I have edited the question to make title and codes consistent. – Elkan May 11 '22 at 02:35
  • 1
    I think your implementation is almost the same as `numpy.bincount`. – Mechanic Pig May 11 '22 at 02:54
  • @MechanicPig A great point. I added this for comparison too, but still `numba` is better. – Elkan May 11 '22 at 03:00
  • @Elkan It seems like I have a lot more to learn, but what exactly do you want the output to be? – Zero May 11 '22 at 03:02

3 Answers3

1

What if shortening your code:

@nb.njit("int64(int64[:])", fastmath=True)
def shortened(x):
    num = np.zeros(x.max() + 1, dtype=x.dtype)
    for k in x:
        num[k] += 1

    return num.max()

or paralleled:

@nb.njit("int64(int64[:])", parallel=True, fastmath=True)
def shortened_paralleled(x):
    num = np.zeros(x.max() + 1, dtype=x.dtype)
    for k in nb.prange(x.size):
        num[x[k]] += 1

    return num.max()

Parallelizing will beat for larger data sizes. Note that parallel will get different result in some runs and need to be cured if be possible.

For handling the floats (or negative values) using Numba:

@nb.njit("int8(float64[:])", fastmath=True)
def shortened_float(x):
    num = np.zeros(x.size, dtype=np.int8)
    for k in x:
        for j in range(x.shape[0]):
            if k == x[j]:
                num[j] += 1

    return num.max()

IMO, np.unique(x, return_counts=True)[1].max() is the best choice which handle both integers and floats in a very fast implementation. Numba can be faster for integers (it depends on the data sizes as larger data sizes weaker performance; AIK, it is due to looping instinct than arrays), but for floats the code must be optimized in terms of performance if it could; But I don't think that Numba can beat NumPy unique, particularly when we faced to large data.

Notes: np.bincount can handle just integers.

Ali_Sh
  • 2,667
  • 3
  • 43
  • 66
  • Shortening code by `ndarray.max` seems to have no effect, and the acceleration of `numba` surprised me. – Mechanic Pig May 11 '22 at 03:54
  • `But I don't think that Numba can beat NumPy unique, particularly when we faced to large data.`. I don't really think so. See my update. – Elkan May 12 '22 at 02:51
  • @Elkan , Your benchmarks are just for *integers*. I think `max_count_unique_num(x)` is the fastest for *integers*, but it can not handle *floats*. Your second solution, as you said derived or inspired from `np.unique`, will be a little faster (very close) than `np.unique` on smaller volume of int and float data, not on the huge data (e.g. test it on colab for `np.random.randint(0, 2000000, size=7000000).astype(np.float64)`; which have not decimals, perhaps decimals make it worse). You must consider, too, you have precompiled the code by data types by numba (disadvantage). – Ali_Sh May 12 '22 at 03:18
  • @Elkan, IMO, The fastest method will be related to your testing data volume (as I mentioned in the answer) and its data types (if you want using numba) and … . So I think, `np.unique` will be ***the most general method***, which can handle both integer and float and have other capabilities e.g. `return_index` and so on. I suggested it based on my experiences and as [it is said by one of the most professional experts in performance on my previous questions: *The `np.unique` call is the most expensive, but it is not easy to make it faster.*](https://stackoverflow.com/q/71104627/13394817) – Ali_Sh May 12 '22 at 03:34
  • I read the answer, he mentioned that `np.unique` is not easy to make faster, should be this line `ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)`. **This depends.** I just want to have the maximum occurrence (possibly with item too) in an `1d array` of `float` and `int`. So many implementations in `np.unique` can be simplified. Also, in his code the array has been sorted by `np.sort(ends_ind_org)`. In my case, or in my question, the implementation using `numba` can efficiently get what I want. – Elkan May 12 '22 at 03:42
  • Also, as in your example, I tested on large `float` array. `x = np.random.randn(3000000).astype(np.float64)`. Still significantly faster comparing to `np.unique`. `43.1 ms ± 445 µs per loop` vs `209 ms ± 1.39 ms per loop`. – Elkan May 12 '22 at 03:48
  • @Elkan, each capability of `np.unique` will consume additional time, I know, but the aforementioned quotation is general not just related to my case (I saw such ideas on other questions). For benchmarking that I said, you can [see the results on my colab](https://colab.research.google.com/drive/1zzEm6gMc-3u6aDil2YsFLcD0_ZwqCyTt?usp=sharing). Please see that. Believe I would be happy if you could find a more faster general code instead `np.unique` :) (it will be helpful). If it is very important to you, putting bounty on this question may encourage professional experts (rather than me) to help. – Ali_Sh May 12 '22 at 04:27
  • @Elkan , Do you searching for a general solution or a specific one for your case? I thought you are seeking for a a general solution, from where you wrote *can be improved to a general case*. As I mentioned, it is completely depend on your case that which solution could be better (you are saying Numba is better). `max_count_unique_num_2(x)` have a good performance (+1). Hope others can propose a better optimized solution. – Ali_Sh May 12 '22 at 04:45
  • 1
    @Ali_Sh Thanks. I think if too general, like different number types, more dimensions, etc., would be too complicated. Currently is good enough. +1 for meaningful discussion. – Elkan May 12 '22 at 04:49
  • @Elkan , The last word: However, I believe based on my experience that, such a general code can be written by experts if they be encouraged by bounties (add tag optimization), if you didn't satisfied by your numba code. – Ali_Sh May 12 '22 at 05:03
0

You can do that without using numpy too.

arr = [1,1,2,2,3,3,4,5,6,1,3,5,7,1]
counts = list(map(list(arr).count, set(arr)))
list(set(arr))[counts.index(max(counts))]

If you want to use numpy then try this,

arr = np.array([1,1,2,2,3,3,4,5,6,1,3,5,7,1])
uniques, counts = np.unique(arr, return_counts = True)
uniques[np.where(counts == counts.max())]

Both do the exact same job. To check which method is more efficient just do this,

time_i = time.time()
<arr declaration> # Creating a new array each iteration can cause the total time to increase which would be biased against the numpy method. 
for i in range(10**5):
  <method you want>
time_f = time.time()

When I ran this I got 0.39 seconds for the first method and 2.69 for the second one. So it's pretty safe to say that the first method is more efficient.

Zero
  • 1,800
  • 1
  • 5
  • 16
  • I don't think you understand my question, not the item with max occurrence. I want to get **max occurrence**. I also implemented `unique` and `collections.Counter`. They can't beat `numba`. – Elkan May 11 '22 at 02:42
  • @Elkan So suppose `1` occurred `4` times, so you want 4 instead of 1? – Zero May 11 '22 at 02:43
  • I think your judgment is not comprehensive enough. If you implement it in pure python, `Counter` is better than your first method, because the time complexity of `Counter` is `O(n)`, while yours is `O(n^2)`. In addition, although the time complexity of `unique` is `O(n logn)`, because it is almost implemented in C, it is even faster than `Counter` for large arrays. – Mechanic Pig May 11 '22 at 02:53
0

What I want to say is that your implementation is almost the same as numpy.bincount. If you want to make it universal, you can consider encoding the original data:

def encode(ar):
    # Equivalent to numpy.unique(ar, return_inverse=True)[1] when ar.ndim == 1
    flatten = ar.ravel()
    perm = flatten.argsort()
    sort = flatten[perm]

    mask = np.concatenate(([False], sort[1:] != sort[:-1]))
    encoded = np.empty(sort.shape, np.int64)
    encoded[perm] = mask.cumsum()
    encoded.shape = ar.shape

    return encoded

def count_max(ar):
    return max_count_unique_num(encode(ar))
Mechanic Pig
  • 6,756
  • 3
  • 10
  • 31
  • I implemented your version, can't beat `numba` and `np.bincount`, even worse than `np.unique`. `1.49 ms ± 3.29 µs per loop` for first `x` and `1.69 ms ± 6.34 µs per loop` for second `x`. – Elkan May 11 '22 at 03:18
  • @Elkan This is inevitable, because it only generalizes your program. You can pass the encoded array to the function you implement. – Mechanic Pig May 11 '22 at 03:30