12

I want to write some tests to analyse the efficiency of different operations in python, namely a comparison of dictionary comprehensions and dict generators.

To test this out, I thought I would try a simple example: count the number of words in a list using dictionaries.

Now I know that you can do this using collections.Counter (as per an answer here: How can I count the occurrences of a list item in Python?), but my objective was to test performance an memory.

One "long-hand" way is to do it in a basic loop.

from pprint import pprint

# Read in some text to create example data
with open('text.txt') as f:
    words = f.read().split()

dict1 = {}
for w in words:
    if not dict1.get(w):
        dict1[w] = 1
    else:
        dict1[w] += 1
pprint(dict1)

The result:

{'a': 62,
 'aback': 1,
 'able': 1,
 'abolished': 2,
 'about': 6,
 'accept': 1,
 'accepted': 1,
 'accord': 1,
 'according': 1,
 'across': 1,
 ...

Then I got a bit stuck trying to do the same in a dictionary comprehension:

dict2  = { w: 1 if not dict2.get(w) else dict2.get(w) + 1
            for w in words }

I got an error:

NameError: global name 'dict2' is not defined

I tried defining the dict up front:

dict2 = {}
dict2  = { w: 1 if not dict2.get(w) else dict2.get(w) + 1
            for w in words }
pprint(dict2)

But of course the counts are all set to 1:

{'a': 1,
 'aback': 1,
 'able': 1,
 'abolished': 1,
 'about': 1,
 'accept': 1,
 'accepted': 1,
 'accord': 1,
 'according': 1,
 'across': 1,
 ...

I had a similar problem with dict comprehension:

dict3 = dict( (w, 1 if not dict2.get(w) else dict2.get(w) + 1)
                for w in words)

So my question is: how can I use a dictionary comprehension/generator most efficiently to count the number of occurrences in a list?

Update: @Rawing suggested an alternative approach {word:words.count(word) for word in set(words)} but that would circumvent the mechanism I am trying to test.

Community
  • 1
  • 1
Captain Whippet
  • 2,143
  • 4
  • 25
  • 34
  • `dict2` is empty if first place that's why you got that result. The reason is that you don't insert the results in `dict2` when checking `dict2.get(w)`. I don't know if you can solve this problem with dictionary comprehension since you have to store the counts. – badc0re Nov 04 '14 at 09:34
  • I think the way to do that is `{word:words.count(word) for word in set(words)}`, but I doubt it's efficient. – Aran-Fey Nov 04 '14 at 09:49
  • @badc0re Yep, I think you may be right. Perhaps I need to come up with a better test example. I'll see if anyone else has any ideas. Thanks for your help. – Captain Whippet Nov 04 '14 at 09:50
  • 1
    @Rawing Good point - but it would kind of defeat the purpose of what I'm trying to do. I'll update the question with a note about that, so thanks. – Captain Whippet Nov 04 '14 at 09:51

4 Answers4

13

You cannot do this efficiently(at least in terms of memory) using a dict-comprehension, because then you'll have to keep track of current count in another dictionary i.e more memory consumption. Here's how you can do it using a dict-comprehension(not recommended at all :-)):

>>> words = list('asdsadDASDFASCSAASAS')
>>> dct = {}
>>> {w: 1 if w not in dct and not dct.update({w: 1})
                  else dct[w] + 1
                  if not dct.update({w: dct[w] + 1}) else 1 for w in words}
>>> dct
{'a': 2, 'A': 5, 's': 2, 'd': 2, 'F': 1, 'C': 1, 'S': 5, 'D': 2}

Another way will be to sort the words list first then group them using itertools.groupby and then count the length of each group. Here the dict-comprehension can be converted to a generator if you want, but yes this will require reading all words in memory first:

from itertools import groupby
words.sort()
dct = {k: sum(1 for _ in g) for k, g in groupby(words)}

Note that the fastest one of the lot is collections.defaultdict:

d = defaultdict(int)
for w in words: d[w] += 1 

Timing comparisons:

>>> from string import ascii_letters, digits
>>> %timeit words = list(ascii_letters+digits)*10**4; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
10 loops, best of 3: 131 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; Counter(words)
10 loops, best of 3: 169 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
1 loops, best of 3: 315 ms per loop
>>> %%timeit
... words = list(ascii_letters+digits)*10**4
... d = defaultdict(int)
... for w in words: d[w] += 1
... 
10 loops, best of 3: 57.1 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**4
d = {}
for w in words: d[w] = d.get(w, 0) + 1
... 
10 loops, best of 3: 108 ms per loop

#Increase input size 

>>> %timeit words = list(ascii_letters+digits)*10**5; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
1 loops, best of 3: 1.44 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; Counter(words)
1 loops, best of 3: 1.7 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}

1 loops, best of 3: 3.19 s per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = defaultdict(int)
for w in words: d[w] += 1
... 
1 loops, best of 3: 571 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = {}
for w in words: d[w] = d.get(w, 0) + 1
... 
1 loops, best of 3: 1.1 s per loop
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • 1
    `collections.Counter` seems like for sure the most pythonic way, since this Q&A is no a dupe target, will you update your answer? – Chris_Rands Jan 07 '19 at 15:58
  • 3
    @Chris_Rands: And in fact, as of Python 3.2, `Counter` will win over the `defaultdict(int)` + loop approach. They added a C accelerator for counting input iterables, so where my machine roughly matches Ashwini's `defaultdict(int)` speed (I get 552 ms on Linux x64 Python 3.6.4, ipython 7.2.0, only trivially faster than Ashwini), the `Counter` test is now significantly faster (374 ms; roughly one fifth the speed pre-accelerator, and a solid third lower runtime than the next closest competitor, `defaultdict(int)`). – ShadowRanger Jan 07 '19 at 16:04
  • As of Python 3.6, this [SO answer](https://stackoverflow.com/a/58811298/307454) implies `defaultdict` is faster than `Counter`? – lifebalance Jan 02 '20 at 02:10
  • @lifebalance That answer should be passing the `range()` object directly to `Counter` instead of using a loop. – Ashwini Chaudhary Jan 02 '20 at 13:05
12

You can do it this way:

>>> words=['this','that','is','if','that','is','if','this','that']
>>> {i:words.count(i) for i in words}
{'this': 2, 'is': 2, 'if': 2, 'that': 3}
Irshad Bhat
  • 8,479
  • 1
  • 26
  • 36
1

It is a use case where comprehension is not adapted/efficient.

Comprehension is good when you can build the collection in one single operation. It is not really the case here, since :

  • either you take the words as they come and change values in the dict accordingly
  • or you have to first compute the key set (Rawing solution), but then you browse the list once for getting the key set, and once per key

IMHO, the most efficient way is the iterative one.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
1
#1

words = ['asdsadDASDFASCSAASAS']

word_dic = {}

{word_dic.update({key: (1 if key not in word_dic else word_dic[key] + 1)}) for str in words for key in str}

print(word_dic)

{'a': 2, 's': 2, 'd': 2, 'D': 2, 'A': 5, 'S': 5, 'F': 1, 'C': 1}



#2

year = [14, 14, 60, 12, 12, 75, 22, 22, 56, 31, 31, 31, 70, 70, 17, 49, 49, 45, 45, 68]

num_dic = {}

{num_dic.update({key: (1 if key not in num_dic else num_dic[key] + 1)}) for key in year}

print(num_dic)

{14: 2, 31: 3, 60: 1, 12: 2, 75: 1, 22: 2, 56: 1, 70: 2, 17: 1, 49: 2, 45: 2, 68: 1}
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
  • 1
    Welcome to Stack Overflow! While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Suraj Rao Mar 17 '22 at 15:42