0

I have a data file (temp.dat) that consists of 3 columns and ~20k rows. It looks like this:

0 1 100.00
0 2 100.00
0 3 100.00
...
1 10 100.00
1 11 100.00
1 12 100.00
1 13 100.00
1 14 100.00
1 15 100.00
1 16 100.00
1 17 100.00
...
0 10 100.00
0 11 100.00
0 12 100.00
...

I would like to count the number of rows that satisfy the following criteria in the code. I tried map and list comprehension but both seem incredible slow. List comprehension is about a minute faster.

data = np.genfromtxt('temp.dat')
base1, base2, pct = data[:,0], data[:,1], data[:,2]
expected_count = 10000

BASE_NAME = []
for x in range(0,36):
    count1 = sum(map(lambda base1 : base1 == x, base1)) 
    count2 = sum(map(lambda base2 : base2 == x, base2))
    total_count = count1 + count2
    if total_count == expected_count:
        base_num = x
        BASE_NAME.append(base_num)

total_base_name = len(BASE_NAME)
print (total_base_name)

For list comprehension, the syntax becomes:

count1 = sum([base1 == x for base1 in base1])
count2 = sum([base2 == x for base2 in base2])
norok2
  • 25,683
  • 4
  • 73
  • 99
hamster
  • 69
  • 7
  • Does this answer your question? [How to count the occurrence of certain item in an ndarray in Python?](https://stackoverflow.com/questions/28663856/how-to-count-the-occurrence-of-certain-item-in-an-ndarray-in-python) – norok2 Apr 01 '20 at 12:05
  • Partially, as i ended up using numpy count_nonzero in the end. – hamster Apr 02 '20 at 14:19

1 Answers1

2

(EDITED: I somewhat overlooked that you were using NumPy arrays)

To replace:

sum(map(lambda base1 : base1 == x, base1)) 

or:

count1 = sum([base1 == x for base1 in base1])

the best approach depends if your input is a list or a NumPy array.

  • if you have a list, you could use the list.count() method:
base1.count(x)
  • if you have a NumPy array, as this seems to be the case, you could use np.count_nonzero() for NumPy arrays:
import numpy as np


np.count_nonzero(base1 == x)

However, this will create a potentially large temporary object. This can be solved by creating your own function and accelerate it with Cython (not shown) or, even better, with Numba, as shown below:

import numba as nb


@nb.jit
def nb_count_equal(arr, value):
    result = 0
    for x in arr:
        if x == value:
            result += 1
    return result

which would also be faster than np.count_nonzero() in this case.

Testing some of these approaches on toy data shows that they give the same result:

np.random.seed(0)  # to ensure reproducible results

arr = np.random.randint(0, 20, 1000)
y = 10

print(sum(map(lambda x: x == y, arr)))
# 41
print(sum([x == y for x in arr]))
# 41
print(np.count_nonzero(arr == y))
# 41
print(nb_count_equal(arr, y))
# 41

with the following timing:

arr = np.random.randint(0, 20, 1000000)
y = 10

%timeit sum(map(lambda x: x == y, arr))
# 1 loop, best of 3: 2.54 s per loop
%timeit sum([x == y for x in arr])
# 1 loop, best of 3: 2.43 s per loop
%timeit np.count_nonzero(arr == y)
# 1000 loops, best of 3: 574 µs per loop
%timeit nb_count_equal(arr, y)
# 1000 loops, best of 3: 224 µs per loop

Note, the previous suggestion of removing the square brackets to avoid creating a temporary list is slower than just having a generator because of the way sum() is implemented, but it would definitely have the advantage of avoiding to create unnecessary temporary lists.


Finally, if you are doing this counting multiple times, it may be more beneficial to do this in a single go with np.unique().

norok2
  • 25,683
  • 4
  • 73
  • 99
  • @hamster Sorry I have overlooked that you were using NumPy arrays, check out the updated answer. – norok2 Apr 01 '20 at 13:20
  • numba isn't an option as I have to use a remote cluster computer to do the processing. It isn't installed. I tested with np.unique and count_nonzero and both are running a lot faster for looping 36 times with 20k row per input file. I'll ask if i have anymore question. – hamster Apr 02 '20 at 04:32