8

I know that you can select a random value from a dictionary in several ways.

In Python 2:

random.choice(d.keys())

In Python 3:

random.choice(list(d.keys()))

Nonetheless, both approaches require a transformation, i.e. linear time O(n), to a list before the random selection. For example, I know that in Python 3 d.keys() returns an iterator and I am guessing that in Python 3 the list is created internally from the dictionary.

Is it possible to select a value from a dictionary in constant time, i.e. O(1)?

EDIT: For the comments so far, I think that it is not possible, at least not in a straight forward fashion. Auxiliary structures are required.

EDIT 2: I thought that the dictionary could have a random choice in constant time since internally it is a hash table, i.e. internally it has to have an array or something similar. Of course, it depends on the internal implementation, but theoretically I think it is possible.

toto_tico
  • 17,977
  • 9
  • 97
  • 116
  • What exactly is your use case, and does the standard solution (calling `dict.keys()` and choosing a random value from it) not perform well enough for you? – wkl Sep 26 '15 at 23:13
  • I am building a computer simulation, so I would need to call `dict.keys()` many many times. I am quite sure that it takes O(n) in Python 3 because `dict.keys()` returns an iterator. I don't think the situations is any better in Python 2 which probably converts the dict internally. – toto_tico Sep 26 '15 at 23:17
  • I think for your use case, you'd better to assign key with pre-defined "prefix" with increment "value", ie. "A1", "A2", "B1", "B2"..., then to randomly access the keys, you only need to get random from correct prefix and a random value of O(1) from the length of the "value", which I think is what you want. – Anzel Sep 26 '15 at 23:24
  • @Anzel, the problem is that there is constant deletions and additions to the dictionary so I cannot keep consistent keys. – toto_tico Sep 26 '15 at 23:34
  • There are some other posts that have a similar question, but I think you have constraints that make many other possible solutions (like maintaining a separate data structure that has quicker indexing) suboptimal. A dictionary might not be the right way, or perhaps the approach is not the right way for your needs. I'm not going to vote for closure due to 'broadness' or anything, but your problem seems much more difficult than it appears on the surface. You'll probably need to describe your problem a lot more to get more help. – wkl Sep 26 '15 at 23:41
  • @birryree, yes, my problem is much more difficult.The answers seems to be that it is not possible, but the question is still valid. It seems that I need to keep an extra structure updated (which in my case it would be painful), so I don't have to transform iterate the dictionary or transform it into a list. – toto_tico Sep 26 '15 at 23:58
  • @toto_tico: I second birryree, you should at least state which operations (dictionary lookup, deleting from dictionary, random key choice — anything else?) your data structure must support and how often do you use each of them. – firegurafiku Sep 27 '15 at 00:24
  • @firegurafiku, I am replicating [this model](https://mitpress.mit.edu/sites/default/files/titles/content/ecal2015/ch026.html). I probably need all the operations, and I have been using auxiliary structures to improve on efficiency (also numpy). I am also using networkx, and I was calling the method [nodes()](http://networkx.readthedocs.org/en/stable/_modules/networkx/classes/graph.html#Graph.nodes) which I though was O(1) but then I realized it was transforming the dictionary into a list. A recent version of my code is [here](https://github.com/robertour/miller-knowles/) as well. – toto_tico Sep 27 '15 at 00:41
  • Does the dict keep changing or why would you need to keep calling .keys? – Padraic Cunningham Sep 27 '15 at 01:08
  • @PadraicCunningham, yes, it changes many times. In fact, it changes after each random choice. In reality, it is based in a tournament (selects several items and then look for the winner which is removed of the dict) so it is slightly more complicated. – toto_tico Sep 27 '15 at 01:10
  • Can the dict grow or only lose keys as the tournament progresses? – Padraic Cunningham Sep 27 '15 at 01:18
  • @PadraicCunningham it grows and shrinks. Please see [here](https://mitpress.mit.edu/sites/default/files/titles/content/ecal2015/ch026.html) – toto_tico Sep 27 '15 at 01:22
  • Even if you could access the underlying array you would have problems implementing a uniform sample I think. The Python hash tables do not try to spread the values uniformly through the table. – strubbly Sep 27 '15 at 14:46
  • 1
    Weirdly there is this duplicate http://stackoverflow.com/questions/10840901/python-get-random-key-in-a-dictionary-in-o1 with completely different answers. – strubbly Sep 27 '15 at 15:26
  • @toto_tico, judging by now after understanding more constraints you have stated, dictionary probably is *NOT* the right tool for this case. Could you perhaps add a small sample to reflect how your actual use case is, we may be able to help to find an more efficient alternative. – Anzel Sep 28 '15 at 11:10

4 Answers4

1

There in only one kind of (minor) optimization I can imagine in this situation: do not create list, just get a random number r and iterate d.keys() until you get r-th item.

def take_nth(sequence, n):
    i = iter(sequence)
    for _ in range(n):
        next(i)

    return next(i)

import random
rand_key = d[take_nth(d.keys(), random.randint(0, len(d)-1))]

This would give you a bit better performance, because you wouldn't have to iterate the whole list each time, but it's still a bad idea.

If you want do that random selection repeatedly over a fixed dictionary, than just cache its keys into a separate list and index it with a random index value.

UPD:

To sum up the discussion in comments, the following class with forward/backward caching and reusing deleted items may be helpful:

import random

class RandomSampleDict(object):

    def __init__(self):
        self.data     = {}
        self.cache_ik = {}
        self.cache_ki = {}
        self.track    = []

    def lookup(self, key):
        return self.data[key]

    def set(self, key, value):
        self.data[key] = value

    def add(self, key, value):
        self.data[key] = value
        if len(self.track) == 0:
            i = len(self.data) - 1
        else:
            i = self.track.pop()

        self.cache_ik[i] = key
        self.cache_ki[key] = i

    def delete(self, key):
        del self.data[key]
        i = self.cache_ik[i]
        del self.data_ik[i]
        del self.data_ki[key]

        self.track.append(i)

    def random_sample_key(self):
        key = None
        while key is None:
            i = random.randint(0, len(self.data))
            if i in self.cache_ik:
                return self.cache_ik[i]
firegurafiku
  • 3,017
  • 1
  • 28
  • 37
  • 1
    Forgot a closing paren, should be d[take_nth(d.keys(), random.randint(0, len(d)-1))] –  Sep 26 '15 at 23:37
  • @firegurafiku, I guess this will give me a better performance on average. Thanks. I will try this for now. I was thinking in maintaining a tuple or a list on the side but apparently removing elements is O(n) (https://wiki.python.org/moin/TimeComplexity), so I might end even in a worst position. Apparently there is no linked lists in Standard Python (http://stackoverflow.com/questions/6153348/time-complexity-of-tuple-in-python) – toto_tico Sep 26 '15 at 23:40
  • 1
    @toto_tico: You don't have to *remove* items from your cache list, just *mark* them as `None` on removal and repeat `randint` if necessary. This will work if you don't have to eventually remove the majority of items. – firegurafiku Sep 26 '15 at 23:43
  • @firegurafiku, I actually have to delete a lot of items as well. It is basically a graph that constantly expands and shrinks. – toto_tico Sep 26 '15 at 23:45
  • @toto_tico: If your dictionary undergoes a kind of pulsation with its size approximately the same, then you can keep track on the `None`-marked cache items (in a separate list acting like a stack) and restore them as far as possible. – firegurafiku Sep 27 '15 at 00:04
  • @firefigurafiku, I can grasp what you are saying but it might not work on my problem. The removed nodes shouldn't be part of the random choice. Could you expand on this idea, maybe I am missing something? – toto_tico Sep 27 '15 at 00:23
  • 1
    @toto_tico: I meant something like the following (just wrote to illustrate an idea, not tested): https://gist.github.com/firegurafiku/1a10e7335e2e81e46883 – firegurafiku Sep 27 '15 at 00:48
  • @firegurafiku, yes, I am working in keeping a reversed index. I actually have one (that I use for other issue) but it is a fixed numpy array (so it is not optimal) but I will test it first, if not I think your approach is the general solution. Please, post the link as an answer. Others might find it useful. – toto_tico Sep 27 '15 at 01:04
  • I just wanted to add that I thought that the dictionary could have a random choice in constant time since internally it is a hash table, an array. Of course, it depends in the internal implementation, but theoretically I think it is possible. I will edit my question to add this. – toto_tico Sep 27 '15 at 01:08
  • @toto_tico: If you think the link would be useful... I'd test if it works and post the code here too. – firegurafiku Sep 27 '15 at 01:11
  • @toto_tico: I don't think there is a way to access internal hash table array, unless you're working with C API. If the problem really matters for your research, you may want to write a C module which provides a function like `dict_hashtable`. – firegurafiku Sep 27 '15 at 01:15
  • @firegurafiku I agree, I was suggesting that theoretically it should be possible to implement, as in improving Python API - but I didn't feel strong to suggest it explicitly. I am still a newbie that came from Java. – toto_tico Sep 27 '15 at 01:27
  • 1
    @toto_tico Another approach may be to simply roll out your own hash table implementation in pure Python. If your hashes are big enough, they would outperform built-in dictionaries due to copy elimination. – firegurafiku Sep 27 '15 at 01:39
  • @firegurafiku The reverse index seem to have worked. I manage to reduce the times by ~50% (with a few others optimizations here and there). Thanks! – toto_tico Sep 27 '15 at 14:04
  • @toto_tico: You're welcome. Feel free to edit my answer's code (or post your own) if you have introduced major improvements. – firegurafiku Sep 28 '15 at 05:45
1

next(islice(d.values(),np.random.randint(0, len(d)-1),None)) is the best performing method I've found to select a random value from dict d in Python 3. This is explained in the following discussion.

Some standard library random methods take much more run time than comparable numpy.random methods. For example:

import numpy as np

timeit random.randint(0, 10)
100000 loops, best of 3: 2.52 µs per loop

timeit np.random.randint(0, 10)
1000000 loops, best of 3: 453 ns per loop

Using numpy.random.randint can improve the runtime of methods for selecting a random value of a dict:

from itertools import islice
import random

d = {1:'a',2:'b',3:'c',4:'d',5:'e',6:'f',7:'g',8:'h',9:'i',10:'j'}

timeit next(islice(d.values(),random.randint(0, len(d)-1),None))
100000 loops, best of 3: 3.58 µs per loop

timeit next(islice(d.values(),np.random.randint(0, len(d)-1),None))
100000 loops, best of 3: 1.26 µs per loop

# d[5] access time is about 25X smaller than 1.26 µs
timeit d[5]
10000000 loops, best of 3: 51.3 ns per loop

def take_nth(sequence, n):
    i = iter(sequence)
    for _ in range(n):
        next(i)
    return next(i)

timeit d[take_nth(d.keys(), random.randint(0, len(d)-1))]
100000 loops, best of 3: 5.07 µs per loop

timeit d[take_nth(d.keys(), np.random.randint(0, len(d)-1))]
100000 loops, best of 3: 2.66 µs per loop
  • How did you manage to run that, `random.sample` doesn't seem to accept dictionaries: `TypeError: Population must be a sequence or set. For dicts, use list(d).` – toto_tico Sep 26 '15 at 23:53
  • It works for Python 2 and I have qualified the answer in that regards. I use Python 2 since its widely used with numpy and libraries that depend on it such as pandas and makes compatiblitity with existing scripts and tutorials easier. Also, I have noticed when comparing benchmarks that Python 2 usually runs faster than Python 3, sometimes by as much as 50%, for example see http://stackoverflow.com/questions/32781288/looking-for-a-number-with-specific-conditions-in-python/32784496#32784496. –  Sep 27 '15 at 00:03
  • 3
    Python 3 works just as well the numeric stack. In any case, `random.sample` in 2 doesn't do anything magical -- it simply calls `list(population)` inside in this case, and so isn't any different from the OP calling it directly. – DSM Sep 27 '15 at 00:14
  • @DSM, that is what I thought. I am also using numpy, scipy on the side to keep some structures, and I am about to use them as well to solve this problem in a more manual (i.e. coffee-based) way. Thanks, anyway. At least, I am sure that there was no free approach for me today. – toto_tico Sep 27 '15 at 00:17
  • @DSM: for whatever reason, random.sample(d,1) runs about 14% faster than random.sample(list(d),1) today, where d = {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j'}, based on my timeit tests using Anaconda Python 2.7.10 x64 on Windows 7 x64. –  Sep 27 '15 at 01:10
1

It is clear, I think, that this is not possible through the standard dict public API.

However, there are several drop-in replacements for dict which provide efficient access to the keys in some sorted order. This can then be indexed to obtain a random element. Although their theoretical asymptotics are not identical to dict, in practice they generally perform as well or better.

The blist package from Stutzbach Enterprises provides blist.sorteddict which is specifically tested to be completely compatible with dict. It provides indexing into its key view which is logarithmic complexity. It is implemented with B+Trees.

The SortedContainers package from Grant Jenks provides sortedcontainers.SortedDict which similarly provides efficient indexing of its key view.

Others are also available, typically based on search trees.

strubbly
  • 3,347
  • 3
  • 24
  • 36
  • I don't understand why some body downvote this answer. This might actually be an alternative. The doc says that access to the index is O(1) when the dictionary's size hasn't change recently, otherwise O(log). My dict changes quite often, but relatively not so often. It might work. – toto_tico Sep 27 '15 at 13:58
  • 1
    ... or maybe not, it has O(log) for key removal, and if I consider the constant times for keeping tress, this might not be the best idea after all. Plus, I forgot I am using the networkx implementation, so I would have to keep two structures. Anyway, thanks!, I think somebody might find the idea useful. – toto_tico Sep 27 '15 at 14:01
  • Two structures is a problem, but I wouldn't worry about the log(n) removal. The theoretical difference between O(1) and O(log) is often swamped by other considerations - the benchmarks for both of these container classes is certainly comparable to dict. – strubbly Sep 27 '15 at 14:41
  • I would agree for really big structures but the problem is that the structure is not so huge (1000 nodes), so the constants of keeping trees might really kill the benefits. Though, I might try the blist as auxiliary structure with O(log n) removals and additions (O(1) when size changes are nor recent) might be my best bet. – toto_tico Sep 27 '15 at 15:57
  • 1
    The many factors involved (eg processor cache, hash collisions etc) mean that the only way to be sure is to try it. Maybe try the sortedcontainer package too: the benchmarks here make it look competitive: http://www.grantjenks.com/docs/sortedcontainers/performance.html Especially because I suspect the fixed overheads are lower (since it is based on lists). – strubbly Sep 27 '15 at 18:58
1

Under the assumption that this can't be done using the Python dict alone and that a second data structure is required, then here's a cheap and efficient secondary data structure which just tracks the current nodes.

It just keeps nodes in a list. To support delete it just empties the location and keeps another list of free space.

Note that if you only delete nodes randomly then this is fine as it stands. If you want to delete nodes that are chosen by some other method, then you'll need to store the sequence numbers in the nodes, so that you can find them to delete.

It works well unless you get into the situation where the nodes list becomes mostly empty when the random sampling becomes slow. If you need to handle that situation, then you'll need to reallocate the list at that point - which is OK as an amortised cost but adds quite a bit of complication. For example, you'll need to add a dictionary from nodes to sequence numbers and update that when you reallocate the nodes list.

import random
RNG = random.Random()

class Tracker(object):

    def __init__(self):
        self.free = []
        self.nodes = []

    def add(self,node):
        if self.free:
            seq_num = self.free.pop()
            self.nodes[seq_num] = node
        else:
            seq_num = len(self.nodes)
            self.nodes.append(node)

    def random_node(self):
        seq_num = RNG.randint(0,len(self.nodes)-1)
        while self.nodes[seq_num] == None:
            seq_num = RNG.randint(0,len(self.nodes)-1)
        return self.nodes[seq_num],seq_num

    def delete(self,seq_num):
        self.nodes[seq_num] = None
        self.free.append(seq_num)

    def delete_random_node(self):
        node,seq_num = self.random_node()
        self.delete(seq_num)
        return node

There may be some small optimisations available here. Replacing the free list by a collections.deque might make it a little faster because lists slow down a little if their size changes too often. But it's no big deal. I think your nodes list will hit an equilibrium size and then become very efficient but you could pad it out with Nones to start with to avoid the start up cost of repeatedly growing. You could do a little common sub-expression elimination. But all these will have only small effects.

strubbly
  • 3,347
  • 3
  • 24
  • 36
  • This definitely worth a try. My problem would be that I need to pick more than one node at random (e.g. random.sample instead of random.choice). The simulation is set to 2.5% of the nodes but it is one of the factors of analysis. The simulation has many details, and if I try to explain all of them I would confuse the issue. For full details, the articles is [here](https://mitpress.mit.edu/sites/default/files/titles/content/ecal2015/ch026.html). – toto_tico Sep 28 '15 at 16:58
  • I think you could just call `random_node` the right number of times and check for dupes. The number of nodes is `len(self.nodes) - len(self.free)`. That will work fine unless there are too many free slots. – strubbly Sep 28 '15 at 21:57