1

I'm looking for a more efficient implementation for a generic "dictionary counter". Currently this naive function produces a faster result compared to the collections.Counter implementation

def uniqueCounter(x):
    dx = defaultdict(int)
    for i in x:
        dx[i] += 1
    return dx

EDIT: Some characteristic sample input:

c1= zip(np.random.randint(0,2,200000),np.random.randint(0,2,200000))
c2= np.random.randint(0,2,200000)

c1: 
uniqueCounter timing: 
10 loops, best of 3: 61.1 ms per loop
collections.Counter timing:
10 loops, best of 3: 113 ms per loop 

c2:
uniqueCounter timing: 10 loops, best of 3: 57 ms per loop
collections.Counter timing: 10 loops, best of 3: 120 ms per loop
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
Gidon
  • 537
  • 2
  • 6
  • 18
  • 3
    Please define "fast enough". How did you measure? What performance do you need? What does your data look like? What are you trying to do? ... – Fredrik Pihl Feb 23 '14 at 21:54
  • 1
    This surely belongs on http://codereview.stackexchange.com – anon582847382 Feb 23 '14 at 21:56
  • What kind of objects do you have to count? Numbers? Strings? Any kind of object? How many elements do the input have at most and on average? – Bakuriu Feb 23 '14 at 21:59
  • And are you sure you are CPU bound? Sounds like an I/O problem to me. – tripleee Feb 23 '14 at 21:59
  • You can set a global variabel, for instance A and then write A+=1 whenever you want to increase it, or A = some value. Here's an SO post on how to set a global variable http://stackoverflow.com/questions/423379/using-global-variables-in-a-function-other-than-the-one-that-created-them – user1749431 Feb 23 '14 at 22:05
  • 1
    @user1749431 What? What will the OP do with a global variable? Also accessing globals is *much* slower than accessing local variables, so that's going to slow things down. – Bakuriu Feb 23 '14 at 22:10
  • No doubt about the difference between using global and local variables but using them as class parameters is easier, and sometimes faster than with dx[i] +=1 depended on the purpose, http://stackoverflow.com/questions/12590058/python-performance-with-global-variables-vs-local – user1749431 Feb 23 '14 at 22:27

1 Answers1

1

Try using numpy.bincount

In [19]: Counter(c2)
Out[19]: Counter({1: 100226, 0: 99774})

In [20]: uniqueCounter(c2)
Out[20]: defaultdict(<type 'int'>, {0: 99774, 1: 100226})

In [21]: np.bincount(c2)
Out[21]: array([ 99774, 100226])

Some timings:

In [16]: %timeit np.bincount(c2)
1000 loops, best of 3: 2 ms per loop

In [17]: %timeit uniqueCounter(c2)
1 loops, best of 3: 161 ms per loop

In [18]: %timeit Counter(c2)
1 loops, best of 3: 362 ms per loop
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • `bincount` requires the input to be an array-like object containing non-negative integer data, and if `max(c2)` is much greater than `len(c2)`, it can be highly inefficient. It's good for when you can use it, but it's far from a general-purpose solution. – user2357112 Feb 23 '14 at 23:00
  • bincount is working great for the c2 example, but it is not suitable for a list of tuples such as c1 – Gidon Feb 24 '14 at 06:22
  • perhaps you can tweek your data-structures so that you can use bincount() – Fredrik Pihl Feb 24 '14 at 15:47