0

In this answer, I am creating a dict of empty sets by iterating through a list. Then I iterate through the same list & filling those sets. An MRE:


# imports we need
import time
import numpy as np
np.random.seed(42)

What I am doing

An example list of letters. Note that at least one letter will appear more than once.

letters=[np.random.choice([letter for letter in string.ascii_lowercase]) for _ in range(1000)]

Result:

['w', 'n', 'k', 'o', 'm', 'r', ...

Creating a dict with letters as keys, empty sets as values:

letterdict={letter:set() for letter in letters}

Iterating through the letters list again, each entry in the list with the corresponding letter will be a set where the indices of that letter appears in the letters list:

for index, letter in enumerate(letters):
    letterdict[letter].add(index)

letterdict will look like:

{'w': {0, 12, 62, 67, 69, ...


How long it was

This process took:

start = time.time()
letterdict={letter:set() for letter in letters}
for index, letter in enumerate(letters):
    letterdict[letter].add(index)
end = time.time()
print(end-start)

0.000538... sec.


The question

Is there a way to make the creation of letterdict quicker? Afterall, I am iterating through letters twice.

My thoughts: If I could make it in one loop, when it encounters a letter for the first time, I could create a set, and put the index of the letter in it. When encountering the letter for the second time, it could not reset the set, just add the index. However, checking that a letter is already encountered or not seems tedious (ie slow).

In the MRE, assume, that we do not know what all the letters are, so replacing the first loop by {letter:set() for letter in string.ascii_lowercase} is not really useful.

zabop
  • 6,750
  • 3
  • 39
  • 84
  • 0.000538... sec isn't quite fast enough? – nagyl Jul 25 '20 at 21:17
  • It's just an example, I am using this with a real-world problem, where it takes much longer... – zabop Jul 25 '20 at 21:21
  • 1
    There is already a bit of an improvement if you do `letterdict = {letter:set() for letter in set(letters)}` (getting about 20-25% improvement by doing this when otherwise using your code but with 100000 in place of 1000). – alani Jul 25 '20 at 21:29
  • 1
    Mind you, you could probably get rid of that first bit entirely by using a `defaultdict(set)`. Going to give that a test.... and the answer is that it maybe it improves it a little bit more but not dramatically beyond the improvement already mentioned (that is with `from collections import defaultdict` and `letterdict = defaultdict(set)`) – alani Jul 25 '20 at 21:33
  • @alaniwi, Thanks! That indeed seems to make it faster, good idea. – zabop Jul 25 '20 at 21:39

2 Answers2

1

You probably want a collections.defaultdict.

This will create the empty set on lookup if the key is not already present:

from collections import defaultdict

letterdict = defaultdict(set)
for index, letter in enumerate(letters):
     letterdict[letter].add(index)

This way you don't have to initialize the dictionary with empty sets.


You could instead just use the .setdefault() method on a normal dict.

letterdict = {}
for index, letter in enumerate(letters):
     letterdict.setdefault(letter, set()).add(index)

This has the overhead of creating a new set object each time whether it's needed or not, but the builtin set() types seem to construct pretty quickly. It wasn't any slower on the small samples I used to time it. CPython might pool these somehow when they get deleted quickly, like it does for tuples.

gilch
  • 10,813
  • 1
  • 23
  • 28
  • Tested this, this is indeed faster than my method. Another improvement seem possible by using `enumerate(set(letters))`, as suggested by @alaniwi in the comments below OP... – zabop Jul 25 '20 at 21:39
  • @zabop wouldn't that change the answer though? You'd get different indexes than before. – gilch Jul 25 '20 at 21:40
  • @zabop For the record, I wasn't suggesting `enumerate(set(letters))`, but only using `set(letters)` where constructing the initial dictionary of empty sets (before I got the same idea as you about defaultdicts, which then made that bit redundant). – alani Jul 25 '20 at 21:44
1

Okay, so I think there is no significant gain in time with this method, but you can do the same job with one loop, by using a try except block.

start = time.time()
for index, letter in enumerate(letters):
    try:
        letterdict[letter].add(index)
    except:
        letterdict[letter] = {index}
end = time.time()
print(end-start)

Basically, we create a set while looping though the list, therefore no need for another loop.

My time, first is your method, second is mine:

0.0009996 #Your method
0.0009992 #My method

As I said, no significant gain, but according to several runs, it slightly faster.

nagyl
  • 1,644
  • 1
  • 7
  • 18