8

I'm trying to count the unique values in a numpy array.

import numpy as np
from collections import defaultdict
import scipy.stats
import time

x = np.tile([1,2,3,4,5,6,7,8,9,10],20000)
for i in [44,22,300,403,777,1009,800]:
    x[i] = 11

def getCounts(x):
    counts = defaultdict(int)
    for item in x:
        counts[item] += 1
    return counts

flist = [getCounts, scipy.stats.itemfreq]

for f in flist:
    print f
    t1 = time.time()
    y = f(x)
    t2 = time.time()
    print y
    print '%.5f sec' % (t2-t1)

I couldn't find a builtin function at first to do this, so I wrote getCounts(); then I found scipy.stats.itemfreq so thought I would use that instead. But it's slow! Here's what I get on my PC. Why is it so slow compared to such a simple handwritten function?

<function getCounts at 0x0000000013C78438>
defaultdict(<type 'int'>, {1: 19998, 2: 20000, 3: 19999, 4: 19999, 5: 19999, 6: 20000, 7: 20000, 8: 19999, 9: 20000, 10: 19999, 11: 7})
0.04700 sec
<function itemfreq at 0x0000000013C5D208>
[[  1.00000000e+00   1.99980000e+04]
 [  2.00000000e+00   2.00000000e+04]
 [  3.00000000e+00   1.99990000e+04]
 [  4.00000000e+00   1.99990000e+04]
 [  5.00000000e+00   1.99990000e+04]
 [  6.00000000e+00   2.00000000e+04]
 [  7.00000000e+00   2.00000000e+04]
 [  8.00000000e+00   1.99990000e+04]
 [  9.00000000e+00   2.00000000e+04]
 [  1.00000000e+01   1.99990000e+04]
 [  1.10000000e+01   7.00000000e+00]]
2.04100 sec
Jason S
  • 184,598
  • 164
  • 608
  • 970
  • 3
    There is a issue ticket on this function [here](https://github.com/scipy/scipy/issues/599). It looks like the maintainers had similar concerns about how it was written. If you want to know why it was so slow, maybe check out [this commit](https://github.com/scipy/scipy/commit/7e04d6630f229693cca3522b62aa16226f174053). Depending on whether you're using a scipy version before or after this commit, you should be able to see how it is implemented. – jedwards Sep 25 '14 at 20:35

4 Answers4

24

If you can use numpy 1.9, you can use numpy.unique with the argument return_counts=True. I.e.

unique_items, counts = np.unique(x, return_counts=True)

In fact, itemfreq was updated to use np.unique, but scipy currently supports numpy versions back to 1.5, so it doesn't use the return_counts argument.

Here's the complete implementation of itemfreq in scipy 0.14:

def itemfreq(a):
    items, inv = np.unique(a, return_inverse=True)
    freq = np.bincount(inv)
    return np.array([items, freq]).T
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • I never realized the speedup the numpy function has. Makes you wonder why it isn't used in the scipy function. – Roger Fan Sep 25 '14 at 20:46
  • Actually I can just use `np.bincount` in my application since I have a large array of small integers. – Jason S Sep 25 '14 at 20:56
3

First, time.time is the wrong function to use when timing, as it measures wall-clock time, not cpu time (see this question). Ideally you would use the timeit module, but time.clock is also better.

Also, it seems that you might be using an outdated version of scipy. I'm using Python 3.4 and scipy 0.14.0 and these are my timings:

x = np.tile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 20000)
for i in [44, 22, 300, 403, 777, 1009, 800]:
    x[i] = 11

%timeit getCounts(x)
# 10 loops, best of 3: 55.6 ms per loop

%timeit scipy.stats.itemfreq(x)
# 10 loops, best of 3: 20.8 ms per loop

%timeit collections.Counter(x)
# 10 loops, best of 3: 39.9 ms per loop

%timeit np.unique(x, return_counts=True)
# 100 loops, best of 3: 4.13 ms per loop
Community
  • 1
  • 1
Roger Fan
  • 4,945
  • 31
  • 38
2

For the lazy:

import pandas as pd
pd.Series( my_list_or_array ).nunique()
scottlittle
  • 18,866
  • 8
  • 51
  • 70
0

Thanks for the responses. I can't use numpy 1.9 yet or scipy 0.14 because of some module conflicts in my application but looks like the new scipy.stats.itemfreq is much faster:

import numpy as np
from collections import defaultdict, Counter
import scipy.stats
import time
import timeit

x = np.tile([1,2,3,4,5,6,7,8,9,10],20000)
for i in [44,22,300,403,777,1009,800]:
    x[i] = 11

def getCounts(x):
    counts = defaultdict(int)
    for item in x:
        counts[item] += 1
    return counts

def itemfreq_scipy14(x):
    '''this is how itemfreq works in 0.14:
    https://github.com/scipy/scipy/commit/7e04d6630f229693cca3522b62aa16226f174053
    '''
    items, inv = np.unique(x, return_inverse=True)
    freq = np.bincount(inv)
    return np.array([items, freq]).T

flist = [getCounts, scipy.stats.itemfreq, np.bincount, itemfreq_scipy14, Counter]


for f in flist:
    print f
    print timeit.timeit(lambda: f(x),number=3)

which yields on my PC:

<function getCounts at 0x0000000013F8EB38>
0.148138969181
<function itemfreq at 0x0000000013C5D208>
6.15385023664
<built-in function bincount>
0.00313706656675
<function itemfreq_scipy14 at 0x0000000013F8EDD8>
0.0757223407165
<class 'collections.Counter'>
0.255281199559
Jason S
  • 184,598
  • 164
  • 608
  • 970