15

I have a dict with approximately 17,000 keys. I would like to select one key at a time--it doesn't matter which one, and I don't need it to happen in any particular order (random is fine). However, after I select a key, I will alter the dictionary, perhaps by adding or deleting a key, before selecting another one. Therefore, I do not have a set list of keys that I can iterate through.

Since I don't need to access them in any particular order, I could convert the dict keys into a list each time, and then pop the first element. However, since there are 17,000 keys, making a list takes approximately 0.0005-7 seconds over each iteration, which will take too much time for what I need. Is there a shortcut I could take so that I don't have to compile an enormous list out of dict keys each time I want to select a single key?

Cœur
  • 37,241
  • 25
  • 195
  • 267
hannah
  • 889
  • 4
  • 13
  • 27
  • 5
    https://docs.python.org/3/library/stdtypes.html#dict.popitem – n1c9 Nov 29 '16 at 17:47
  • 8
    Have you considered `next(iter(dct))`? – vaultah Nov 29 '16 at 17:47
  • 2
    Here is a nice piece of code which does exactly this but with a O(1) time complexity. As you are worried about the time, you might be better using this - https://github.com/robtandy/randomdict – SRC Nov 29 '16 at 17:50
  • 2
    Can you explain the purpose? I suspect this is an [XY Problem](http://meta.stackexchange.com/a/66378/344593): What is the underlying problem you are trying to solve? – TemporalWolf Nov 29 '16 at 17:54
  • Use Set instead of List since For Set add/Remove/Contains operations are constant time O(1). – Idali Nov 29 '16 at 18:00
  • Use ``popitem`` as @n1c9 already suggested. If the item has to stay in the dict, simply re-add it before going on to modify the dictionary further. This should answer the requirements because the items in a dict (like a set) have no particular order so when popping one you basically get a fairly random one. – Irmen de Jong Nov 29 '16 at 18:04
  • Use Set instead of List since For Set add/Remove/Contains operations are constant time O(1). Use another collection for loop and use the Set in your logic to check if a key exist and delete it when needed. Overal complexity for you algorithm/ method is O(n) – Idali Nov 29 '16 at 18:07
  • @IrmendeJong When re-adding you have to do so only *after* the complete loop has ended, otherwise you can end up in an infinite loop... – Bakuriu Nov 29 '16 at 18:13
  • 2
    You've asked for a random item in the title, but your question body says you "don't need it to happen in any particular order". Can you edit to clarify whether any order is fine, or whether you specifically need a random order? Also, if you select a particular key and then don't immediately remove it, is it okay to select that key again (and again, and again)? Or do you need some guarantee about revisited keys and keys that never get visited? – user2357112 Nov 29 '16 at 18:13
  • @SRC that link should be an answer, it's the best answer by far as it maintains randomness while drastically improving performance. – TemporalWolf Nov 29 '16 at 19:10
  • I added some code to my answer. You can get all items from dict() containing 17000 items in about 0.297 secs. User doesn't even blink. Is that acceptable speed? – Dalen Nov 29 '16 at 21:46
  • @user2357112 I've clarified the title to match the message – Cœur Nov 30 '16 at 10:44

4 Answers4

6

There are multiple ways, but you'll need to make some tradeoffs. One way is to empty the dictionary out using popitem; it is atomic, and will use an arbitrary order. But it modifies the dictionary itself; whatever item was selected isn't in it anymore. The next method that comes to mind is iterating as usual, even while modifying the dictionary; the order of items might change, so you could get items any number of times. To track that, you could build a second set of visible keys. It's reasonably cheap to add keys to the set, cheap to check if each item is in it, and when you've gone through the whole dictionary you can check if the set matches the dictionary's keys to determine if there are ones you missed (or removed). You do end up building a key set but only one item per iteration; in the pessimal case we have the dictionary being modified in such a way we scan through the whole set of visited items before finding the new item.

Is there a reason this data needs to be kept in a dictionary only? For instance, if we consider a system where we're shuffling songs, we might not want to visit the whole library but only place a limit on how recently a song has been played. That could be more efficiently handled using a list of songs wherein we can read a random index, a set of recently played songs to avoid duplicates, and a queue (perhaps in a list or deque) of songs allowing us to update the set in order (removing the last entry each iteration). Bear in mind that references are reasonably cheap.

Rethinking one more step we wouldn't need the keys to check for duplicates if they simply aren't in our candidates; by just swapping the oldest played song with the randomly selected next song, both the played and candidate lists stay constant size and no lookups are needed since songs are in only one of the lists.

Another idea is to use collections.ChainMap to keep a consistent view into two dictionaries; ones that have been visited and ones that have not. You could then migrate items from the latter to the former by way of popitem, ensuring a readable method of processing everything in the collection while keeping it dictionary-like.

def getnewitem(chainmap):
    # Raises KeyError when finished
    key,value=chainmap.maps[0].popitem()
    chainmap.maps[1][key]=value
    return key,value

As that means both dictionaries keep changing, it's likely not the fastest overall, but it maintains both a dictionarylike collection and a capability to process all items. It does lose the ability to directly delete items, since ChainMap cannot hide inherited mappings; you'd need to remove them from the backing dictionaries.

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26
  • if the item has to remain in the dict you could simply re-add it directly after the ``popitem()`` – Irmen de Jong Nov 29 '16 at 18:04
  • 1
    "The next method that comes to mind is iterating as usual, even while modifying the dictionary" - doesn't work. Even if the dictionary doesn't get rebuilt partway through, you'll probably trigger a "dictionary changed size during iteration" safety check. – user2357112 Nov 29 '16 at 18:10
  • You could, but the question is how strict the "selecting *another* one" requirement is. Calling popitem after you may or may not have done other modifications might produce the same item repeatedly. If we were transferring between two collections it might work well enough. – Yann Vernier Nov 29 '16 at 18:11
  • @user2357112: You're right. That means another cause for restarting the iteration, possibly turning the method pathologically bad. We're left with the two core questions of why the data is specifically in a dictionary, and what the order requirements are (like should all entries be selected eventually). – Yann Vernier Nov 29 '16 at 18:18
3

As SRC mentioned in the comments, the ideal solution is an indexed dictionary, which is available via randomdict:

Building a 17,000 k,v dict and running timeit:

t = timeit.Timer(my_dict.random_item)
print t.repeat()

[2.3373830318450928, 2.284735918045044, 2.2462329864501953]

which gives about 2.2μs/choice.

The other suggested answers are either not as fast, not random, or both.

TemporalWolf
  • 7,727
  • 1
  • 30
  • 50
2

Thank you, vaultah! You proposed:

next(iter(dict)))

This takes approximately 0.00003 seconds, reducing time by a bit more than a factor of 10, and therefore works as fast as I need it to.

n1c9, you also made an interesting suggestion of:

dict.popitem()

This is a function I hadn't known about before, but unfortunately takes 0.0002 seconds, not much of an improvement over my initial time.

demongolem
  • 9,474
  • 36
  • 90
  • 105
hannah
  • 889
  • 4
  • 13
  • 27
  • 1
    it worth noting there is no guarantee of randomness by using either of those methods. – TemporalWolf Nov 29 '16 at 18:12
  • 4
    Your timings are not fair to `popitem`... keep in mind that it also *removes the element* which means that `popitem` does a lookup *and* an update, while `next(iter(dct))` only does the lookup, but if you have to remove that element you end up paying also the removal operation! If *most* of the elements end up being removed then `popitem` should be faster overall, if only a few then `popitem` would make too much modifications. The point is: don't just profile the single operation, profile the whole loop! – Bakuriu Nov 29 '16 at 18:16
  • It's also worth mentioning a full solution (subclassing dict to add an unordered index of keys which can be `random.choice`'d in `O(1)` to this takes `print timeit.timeit("lambda: random.choice(dog_dict)", number=1) >>> 0.0000031s`, (where dog_dict contains 17000 k,v pairs) which is a full order of magnitude faster, although insertions and deletions become slightly more costly, as you have to maintain the index – TemporalWolf Nov 29 '16 at 18:51
  • It may also be worth noting that I am consistently seeing a small but significant (~5%) improvement of `next(d.iterkeys())` over `next(iter(dict))`. For `d = {k: k * k for k in xrange(17000)}`, I'm using `%timeit next(d.iterkeys())` (120–122 ns per loop) and `%timeit next(iter(d))` (129 ns per loop). This is Python 2.7.6 with iPython 5.1.0's timeit interface. – wchargin Nov 29 '16 at 21:39
  • In Python 3.6+ (CPython to be precise), dictionaries will be ordered, so this solution will give you the keys/items in the same order (reversed with popitem) in which you inserted them. – skrx Dec 06 '16 at 13:09
  • @skrx: currently the fact that dicts are ordered is an implementation detail, so it should not be completely relied upon (see https://stackoverflow.com/questions/39980323/). There are indications that this may become official in a later version of Python (e.g. https://twitter.com/raymondh/status/850102884972675072, but note that the new alpha 3.7 release notes don't mention this https://docs.python.org/3.7/whatsnew/3.7.html). – Tom Church Aug 18 '17 at 19:44
  • 1
    @TomChurch I didn't say you should rely on dicts being ordered. What I wanted to say is that the solution in this answer cannot be used in the latest Python versions, because the dicts are ordered. This may change again in the future, but that doesn't mean this solution should be used in that case, exactly because it's an implementation detail. – skrx Aug 19 '17 at 01:28
0

As dict() is sorted according to internal hashes used for fast access and not by order in which you added elements to it, you can take it for be random enough and use dict.popitem().

But popitem() will also remove this element from the dictionary. So you may wish to use:

d = {...}
keys = d.keys()
item = keys.pop(0)
value = d[item]

instead. However, note that any dict with same/similar keys may have the same order of keys.

If you want proper random getting then do:

import random
d = {"red": "#ff0000", "green": "#00ff00", "blue": "#0000ff", "black": "#000000", "white": "#ffffff"}
keys = d.keys()
item = random.choice(keys)
value = d[item]

Of course, if you wish to prevent repetition you will have to use additional dict() and while loop:

import random
d = {"red": "#ff0000", "green": "#00ff00", "blue": "#0000ff", "black": "#000000", "white": "#ffffff"}
keys = d.keys()
used = {}
def get_rand_item (d):
    item = random.choice(keys)
    while item in used:
        item = random.choice(keys)
    value = d[item]
    used[item] = None
    return item, value
get_rand_item(d)

Here I use dict as storage because its search is faster than a list.

You asked for the fastest way. :D

While I am at it, let see if I can get even faster way of getting random items without repetitions:



from random import choice

class RandomGetter:
    def __init__ (self, d, adapt=1):
        self.keys = keys = d.keys()
        if adapt:
            # Could be done in place too
            dct = {}
            for k in keys:
                dct[k] = (d[k], 0)
            self.dct = dct
            self.count = 1
        else:
            self.dct = d
            # Assume all items have been visited
            self.count = d[keys[0]][1]+1
        self.visited = 0
        self.length = len(self.dct)

    def __len__ (self):
        return self.length

    def randitem (self, default=None):
        if self.visited==self.length:
            # After 'default' is returned (all items gotten),
            # same RandomGetter() can be used again:
            self.count += 1
            self.visited = 0
            return default
        d  = self.dct
        kz = self.keys
        c  = self.count
        key = choice(kz)
        value, flag = d[key]
        while flag>=c:
            key = choice(kz)
            value, flag = d[key]
        d[key] = (value, flag+1)
        self.visited += 1
        return key, value

    def next (self):
        i = self.randitem()
        if i==None: raise StopIteration
        return i

    def __iter__ (self):
        while 1: yield self.next()

# Now testing:
# Lets create a dictionary of one million items:
d = dict.fromkeys(xrange(1000000))
# This takes about 0.128 seconds
# Now, lets initialize Rg
r = RandomGetter(d)
# If dict is not prepared in advance, as this one isn't we use adapt=1 and it takes
# about 8.92 seconds. Yack!
# Now measure time for each random getting:
from time import time
def check ():
    randitem = r.randitem # Faster access to the method
    e = []
    for _ in xrange(len(r)):
        t = time()
        randitem()
        e.append(time()-t)
    print "Total/min/max/med/avg/(0 time count)"
    e.sort()
    s = sum(e)
    if len(r)%2==0: m = (e[len(r)/2]+e[len(r)/2+1])/2
    else: m = e[len(r)/2+1]
    print s, e[0], e[-1], m, ("%.15f" % (s/1000000)), e.count(0.0)
check()
# It yields following results on my machine:
# About 25.224 seconds to randomly get all 1000000 items
# Minimal time needed is not measurable using this technique so it is 0.0
# Maximal time needed to get an item is about 1.678 seconds
# Median results with 0.0, thus we know that more than half randomly gotten items took practically no time
# In fact, there are about 998808 items with getting time of 0.0 seconds
# Average getting time is about 0.000025224 seconds
# By examining results closely I deduced that only last few items took a long time to get them.
# All in all, not bad for one million items, in my opinion anyway.
# For dict of 2000 items, total time was 0.016 and that was also the maximal value and it was for the last gotten item
# Time didn't cross one second until length of a given dictionary wasn't bigger than 100000
# If you want, you can run my code through timeit to recheck, but it seems that it is faster
# than suggested random dictionary.

Dalen
  • 4,128
  • 1
  • 17
  • 35
  • Note that `.pop(0)` only works in python2... in python3 `keys` return a `set`-like object which is not ordered.. – Bakuriu Nov 29 '16 at 18:14
  • Well then just use item = keys[0]; del keys[0]. But in that case, well in case of pop() too, it is better to use the last element: keys[-1], it'll be faster. Lists aren't meant to support fast editing, but fast random memory access. – Dalen Nov 29 '16 at 18:24
  • The keys() method builds the large list OP complained about (or list(d.keys()) in Python 3), and on top of that list.pop(0) is O(n). list.pop() will take the last item, which is faster as the others don't need to move. As the used set grows, repeated random choices to find the last items not in it grows expensive. – Yann Vernier Nov 29 '16 at 18:27
  • Why would you even bother with removing the element from the keys list? You're just going to throw that list away. – user2357112 Nov 29 '16 at 18:31
  • @YannVernier : Yes, but we create keys list only once, and then we are free to use its fast memory access to get to the random key. If we want real random on really big dictionary, and we want it to work fast, well that's my best bet. Get all keys, and then only time consuming operations are random.choice() and dict.get(). – Dalen Nov 29 '16 at 18:49
  • If we want to prevent repetitions, there is nothing for it but to keep track of already gotten items. This will not inpeed speed to much as it is possible that random.choice() will rarely repeat itself. Well checking if element is used will consume some additional time. And speed will decrease toward the end of choosing random elements. But, well if you have some better idea, tell us. – Dalen Nov 29 '16 at 18:52
  • If we're doing the batch approach we could also random.shuffle() the list in the first place, eliminating the need for the used set. – Yann Vernier Nov 29 '16 at 18:54
  • Yeah, I just thought of it and went to recheck the docs, but you were faster. Well, random.shuffle() will modify the list in place and complexity is O(N). Then we will have to use list.pop() to get the last element. Hm, first getting the large list of keys, then shuffling them, how long will it take? All in all, it later should be faster than rechecking if an item was taken or not. The most fastest thing with my solution would be that the original dictionary contains a flag in each item counting number of visits. – Dalen Nov 29 '16 at 19:12
  • @YannVernier : Here, I used flagging idea and it works fine. See my edit. I wasn't expecting very good results. It surprised me. – Dalen Nov 29 '16 at 21:33