fastest way to count the number of occurences of a character in a numpy.chararray

Question

Pythonists,

What is the fastest way to count the occurrence of a character in a numpy.character array.

I am doing the following:

In [59]: for i in range(10):
...:     m = input("Enter A or B: ")
...:     rr[0][i] = m
...:     
Enter A or B: B
Enter A or B: B
Enter A or B: B
Enter A or B: A
Enter A or B: B
Enter A or B: A
Enter A or B: A
Enter A or B: A
Enter A or B: B
Enter A or B: A

In [60]: rr
Out[60]: 
chararray([['B', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A']],
          dtype='<U1')

In [61]: %timeit a = rr.count('A')
12.5 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [62]: %timeit d = len(a[a.nonzero()])
3.03 µs ± 54.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I believe there must be a better way to achieve this with speed and elegance.

Would all elements in the chararray be single characters? – Divakar Sep 03 '18 at 07:41 — Divakar, Sep 03 '18 at 07:41

Divakar · Accepted Answer · 2018-09-03T09:47:00.923

It's better to stick to regular NumPy arrays over the chararrays :

Note:

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

Going with the regular arrays, let's propose two approaches.

Approach #1

We could use np.count_nonzero to count the True ones after comparison against the search element : 'A' -

np.count_nonzero(rr=='A')

Approach #2

With the chararray holding single character elements only, we could optimize a lot better by viewing into it with uint8 dtype and then comparing and counting. The counting would be much faster, as we would be working with numeric data. The implementation would be -

np.count_nonzero(rr.view(np.uint8)==ord('A'))

On Python 2.x, it would be -

np.count_nonzero(np.array(rr.view(np.uint8))==ord('A'))

Timings

Timings on original sample data and scaled to 10,000x scaled ones -

# Original sample data
In [10]: rr
Out[10]: array(['B', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A'], dtype='<U1')

# @Nils Werner's soln
In [14]: %timeit np.sum(rr == 'A')
100000 loops, best of 3: 3.86 µs per loop

# Approach #1 from this post
In [13]: %timeit np.count_nonzero(rr=='A')
1000000 loops, best of 3: 1.04 µs per loop

# Approach #2 from this post
In [40]: %timeit np.count_nonzero(rr.view(np.uint8)==ord('A'))
1000000 loops, best of 3: 1.86 µs per loop

# Original sample data scaled by 10,000x
In [16]: rr = np.repeat(rr,10000)

# @Nils Werner's soln
In [18]: %timeit np.sum(rr == 'A')
1000 loops, best of 3: 734 µs per loop

# Approach #1 from this post
In [17]: %timeit np.count_nonzero(rr=='A')
1000 loops, best of 3: 659 µs per loop

# Approach #2 from this post
In [24]: %timeit np.count_nonzero(rr.view(np.uint8)==ord('A'))
10000 loops, best of 3: 40.2 µs per loop

Can you mention that `chararray` is deprecated in favor of regular `array`s? If you do that I can delete my answer altogether :-) — Nils Werner, Sep 03 '18 at 09:41
@NilsWerner Added. For the timing comparison, would be nice to have your solution, if you would like to keep it. — Divakar, Sep 03 '18 at 09:47

score 1 · Answer 2 · answered Sep 03 '18 at 07:43

chararray is deprectated, use array(..., dtype='<U1') instead. That being said you can do

r = np.array([['B', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A']])

%timeit numpy.sum(r == 'A')
# 4.82 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

fastest way to count the number of occurences of a character in a numpy.chararray

2 Answers2

Linked