4

I'm looking for a faster of way of sampling a single element at random from a large Python set. Below I've benchmarked three obvious examples. Is there a faster way of doing this?

import random
import time

test_set = set(["".join(["elem-", str(l)]) for l in range(0, 1000000)])

t0 = time.time()
random_element = random.choice(list(test_set))
print(time.time() - t0)

t0 = time.time()
random_element = random.sample(test_set, 1)
print(time.time() - t0)

t0 = time.time()
rand_idx = random.randrange(0, len(test_set)-1)
random_element = list(test_set)[rand_idx]
print(time.time() - t0)

Output:

0.0692291259765625
0.06741929054260254
0.07094502449035645
Danny Friar
  • 383
  • 4
  • 17
  • ```set()``` is probably a really bad data-structure for this task (compared to something obvious like a tree where you can follow a random-path in log(n) time)! See also [this discussion](https://stackoverflow.com/questions/124671/picking-a-random-element-from-a-set) (which is not about python) – sascha Jul 07 '17 at 13:16
  • Agreed. Although in what I'm doing their is a previous operation which is much faster with sets so I'm stuck with it – Danny Friar Jul 07 '17 at 13:23
  • Well if you really want to optimize it, much more info, especially about access patterns is needed. What are you doing before and how much time does it cost? How often do you want to sample here and what is the current time-ratio? And so on on... (side-question: did you really check search-trees? I would expect them to be as fast as hashing for this size; but i'm only guessing; but probably you would still need a customized sampler in this case) – sascha Jul 07 '17 at 13:27
  • In [this related question](https://stackoverflow.com/questions/15837729/random-choice-from-set-python), `random.choice(tuple(test_set))` was basically the consensus "winner". (Note that creating tuples is slightly faster than creating lists.) This was also discussed at length among the Python core developers on [python-ideas] in April 2016, with the same conclusion. Also note that if you just want an *arbitrary* element rather than a *random* one, [you can go faster](https://stackoverflow.com/questions/59825/how-to-retrieve-an-element-from-a-set-without-removing-it). – John Y Jul 07 '17 at 17:53

2 Answers2

-1

You could use numpy and add it to your benchmarks.

import numpy
random_num = numpy.randit(0, 1000000)
element = 'elem-' + str(random_num)
test_array = numpy.array([x for x in test_set])

Specifically, this is a piece of code that benchmarks the different methods:

random_choice_times = []
random_sample_times = []
random_randrange_times = []
numpy_choince_times = []
for i in range(0,10):
    t0 = time.time()
    random_element = random.choice(list(test_set))
    time_elps = time.time() - t0
    random_choice_times.append(time_elps)

    t0 = time.time()
    random_element = random.sample(test_set, 1)
    time_elps = time.time() - t0
    random_sample_times.append(time_elps)


    t0 = time.time()
    rand_idx = random.randrange(0, len(test_set)-1)
    random_element = list(test_set)[rand_idx]
    time_elps = time.time() - t0
    random_randrange_times.append(time_elps)

    t0 = time.time()
    random_num = numpy.random.choice(numpy.array(test_array))
    time_elps = time.time() - t0
    numpy_choince_times.append(time_elps)


print("Avg time for random.choice: ", sum(random_choice_times) /10)
print("Avg time for random.sample: ", sum(random_sample_times) /10)
print("Avg time for random.randrange: ", sum(random_randrange_times) /10)
print("Avg time for numpy.choice: ", sum(numpy_choince_times) /10)

Here are the times

>>> Avg time for random.choice:  0.06497154235839844
>>> Avg time for random.sample:  0.06054067611694336
>>> Avg time for random.randrange:  0.05938301086425781
>>> Avg time for numpy.choice:  0.017636775970458984
Diego Aguado
  • 1,604
  • 18
  • 36
  • @kvorobiev why is this not an attempt to answer? – Petter Friberg Jul 07 '17 at 19:37
  • This really fails to answer the question. First, usage of numpy is not the way to go here (single sampling often slower than python's random), but much more important: all approaches compared to are actually working on the set itself while the numpy-approach does not and makes heavy assumptions which are probably not met in the task. Apart from that, using this *cheating*, you can also cheat like that with python's random module and get comparable times (if not better). – sascha Jul 08 '17 at 13:04
  • @sascha considering the way the question is described, here a few thoughts. a) Danny wants to sample only one element from the set. b) Some heavy assumptions I can think of are: uniform distribution among the elements of the set (not that crazy). c) Why is numpy not the way to go here? d) If you are worried to sample among the set itself instead of my approach then the question should be reformulated, don't you agree? e) as you can see python's random is not faster. – Diego Aguado Jul 10 '17 at 21:24
  • A: Maybe. So every approach transforming the data is slow (incl. np-arrays) B: Your assumption is: set is consisting of gapless-integers of form ```elem-X``` in some range a,b. That's obviously only an example and nearly every real-world case won't fit in regards to this *cheating*. C: Again, numpy needs data-transforming (to sample just one element) and a single sample is not necessarily faster than python's random D: Discussable... But your benchmark is very very bad as you compare stuff which is not doing the same thing without remarks! E:I see no p.rand approach mimicking the cheat – sascha Jul 10 '17 at 21:48
  • 1
    You tailored your NumPy solution to the *example*, not to the *problem*. What if the set is different? Let's say `import string; test_set = {x for x in range(500000)} | {y*y for y in range(500000, 1000000)} | {z for z in string.printable}`. Now write a NumPy solution to select from *that* set, and run the timings again. – John Y Jul 10 '17 at 22:12
  • ok I see your point, I'll edit the answer for a more general case, in my defense I didn't get that the example of the elements were merely por demo purposes. – Diego Aguado Jul 10 '17 at 23:04
  • The current version is still not benchmarking correctly given our assumptions. The numpy-approach gets the set -> array conversion for free (not timed as calculated earlier outside of loop) and your np.array() on top of this does nothing (another bad thing). Either create the array on demand (part of timing), or let the other approaches use a pre-calculated list (from the set) too (so it's not timed). – sascha Jul 11 '17 at 00:03
-2

You could try this.

def random_set_ele(set_: set):
    copy = set_
    return copy.pop()

test_set = set(["".join(["elem-", str(l)]) for l in range(0, 1000000)])
start = perf_counter()
print(random_set_ele(test_set))
print(perf_counter()-start)

Result:

elem-57221
0.00016391400276916102

The .pop() method for a set randomly extracts and returns an element and pops it out of the set. We make a copy of the set before popping the element so that the original list is not modified.

Jason Grace
  • 321
  • 11
  • 1. `copy = set_` doesn't copy. 2. `.pop()` extracts an arbitrary element, not a random one. – aaron Feb 01 '23 at 16:12