1

Say I want to draw 400 numbers every time from 50,800 numbers.

import time
import numpy as np


def random_int(original_list, num_to_draw):
    # original_list: from which numbers will be drawn
    # num_to_draw: numbers to draw
    drawn_numbers = np.random.choice(original_list, num_to_draw, replace=False)
    left_over_list = [i for i in original_list if i not in list(drawn_numbers)]
    return drawn_numbers, left_over_list

I made this custom function, and I'm going to iterate for 50800//400 times, which is 127.
But it takes little more than I expected.

%%time
lst = [i for i in range(50800)]
for i in range(50800//400):
    drawn, lst = random_int(lst, 400)

Wall time: 3min 57s

3min 57s was pretty long than I wanted, so I examined how much time does the each iterate take.

lst = [i for i in range(50800)]
for i in range(50800//400):
    start_time = time.time()
    drawn, lst = random_int(lst, 400)
    print(time.time() - start_time, end='\t')

and the parts of the results were,

3.3326096534729004  3.599494218826294   3.4855289459228516  4.159000396728516   4.07768440246582    3.308715343475342   3.601555347442627   3.70918869972229    3.6293230056762695  3.658910036087036   3.7334437370300293  ...

Interestingly, the time taken doesn't really reduce even the list which I'm drawing from is getting smaller and smaller.
Of course it DOES get smaller as iterates over in big sense, but not really consistently(need not linearly) as you can see above the parts of the results.

Is there any ready-made function or package that I can draw all elements exactly once, without replacement efficiently and randomly?

HyeonPhil Youn
  • 428
  • 4
  • 11
  • This is just `random.sample` – user202729 Dec 30 '21 at 03:42
  • See [What does random.sample() method in python do? - Stack Overflow](https://stackoverflow.com/questions/22741319/what-does-random-sample-method-in-python-do) for an explanation. – user202729 Dec 30 '21 at 04:03
  • @user202729 `random.sample` does draw the samples from the list I want, but the main difference is that I cannot iterate over till list exhausts. – HyeonPhil Youn Dec 30 '21 at 14:35

3 Answers3

2

Maybe I misunderstood the question, but it seems to me you can just shuffle the list and draw elements from it, just like you would from a deck of cards.

lst = [i for i in range(50800)]
import time
import random

#shuffle once...
random.shuffle(lst)

def random_int(lst, size):
    drawn, lst = lst[0:size],lst[size:]
    return drawn, lst

start = time.time()
for i in range(50800//400): 
    start_time = time.time()
    drawn, lst = random_int(lst, 400)
    print(time.time() - start_time)

print(f"total={time.time() - start}. {len(drawn)=} {drawn=}")

output:

0.00038909912109375
0.00016999244689941406
0.00018596649169921875
0.0001678466796875
....
6.9141387939453125e-06
6.198883056640625e-06

total=0.011039972305297852. len(drawn)=400 drawn=[36545, 8918, 11276, 18128, 25933, 11644, 1343, 18747, 25229, 15508, 15951, 15182, 28579, 5691, 6205, 10391, 46056, 48219, 7091, 2079, 31694, 20368, 39585, 31945, 37282, 20769, 8921, 36603, 8506, 20351, 49899, 23076, 10146, 4636, 10875, 13229, 32614, 39371, 40984, 32491, 10703, 49320, 12466, 19098, 15324, 27832, 8950, 37885, 2100, 27080, 2970, 32348, 29360, 11823, 21113, 19874, 1183, 24157, 4774, 42503, 42971, 47751, 22417, 29806, 33764, 4880, 40221, 16716, 14938, 14547, 26871, 29769, 10430, 5545, 26288, 29787, 9738, 2420, 4675, 47397, 11819, 46136, 11588, 18584, 48098, 35012, 7742, 28432, 11070, 13269, 15983, 33174, 1872, 49543, 38283, 31872, 8632, 11087, 27565, 47852, 28418, 47115, 47570, 29122, 4792, 3903, 501, 47356, 50785, 23959, 39413, 5137, 22144, 25088, 45446, 12054, 37566, 31654, 1275, 45083, 23774, 43232, 5556, 16429, 9487, 39543, 25642, 25766, 14752, 26075, 47844, 12150, 41891, 46581, 25817, 31309, 35541, 11687, 4602, 44639, 40594, 3351, 14904, 13792, 7313, 37482, 49423, 25311, 32915, 34544, 34203, 50199, 23561, 27484, 36205, 28487, 41661, 35203, 30768, 4556, 8523, 32634, 44687, 28297, 20802, 27721, 12454, 46949, 18045, 49482, 19662, 42442, 18264, 39573, 10094, 3424, 34928, 13293, 6102, 40003, 35410, 28582, 24351, 16615, 30681, 40967, 42700, 6072, 22153, 9564, 14293, 50749, 21510, 4151, 21497, 23424, 20226, 36592, 759, 32050, 50504, 21678, 3816, 37981, 20852, 9391, 34120, 25468, 28900, 48194, 15836, 30361, 42460, 19943, 40113, 42346, 15021, 40058, 1185, 39303, 22093, 31498, 11348, 1863, 41028, 13400, 35249, 21810, 1862, 38572, 1573, 21360, 2640, 37475, 46528, 3780, 12250, 33354, 28228, 37165, 17090, 43750, 22216, 35420, 41431, 14630, 33838, 8814, 34766, 47103, 45788, 42311, 840, 42426, 17015, 22002, 8583, 31934, 32939, 24124, 28299, 13449, 45717, 47747, 47481, 29288, 40304, 16025, 18182, 33322, 26329, 28294, 19657, 21885, 12698, 10683, 42902, 24004, 30595, 42966, 20374, 25780, 44394, 19553, 41248, 39185, 25685, 1708, 16487, 20859, 3299, 1337, 28002, 22948, 12084, 9460, 36707, 45426, 12881, 31418, 22987, 39893, 34539, 17547, 41121, 24866, 369, 7264, 25686, 11889, 39153, 43122, 11264, 21672, 21624, 25614, 10646, 47247, 19110, 31658, 22024, 5414, 3306, 2136, 2225, 6816, 3859, 49869, 35703, 26619, 18971, 47891, 38906, 33532, 46687, 26490, 47767, 18546, 25373, 14163, 28162, 3557, 25342, 38762, 26785, 2877, 21931, 19243, 43416, 6035, 6187, 24717, 48033, 14126, 9225, 22744, 48041, 50240, 11825, 47462, 14433, 2292, 24475, 24361, 49107, 12191, 45544, 34139, 27053, 16208, 32605, 1606, 25718, 43910, 40904, 50528, 26706, 47709, 3318, 16607, 29589, 21680, 19186, 47289, 46388, 35574, 45044, 10838, 21948, 44636, 32372, 44598, 37986, 14056, 44243, 34360, 28085, 30433, 38831, 3464]

JL Peyret
  • 10,917
  • 2
  • 54
  • 73
1

Here is an alternative approach:

g = np.random.Generator(np.random.PCG64())
def random_int_new(original_list, num_to_draw):
    if not isinstance(original_list, np.ndarray):
      original_list = np.array(original_list)

    mask = np.zeros_like(original_list, dtype=bool)
    mask[g.choice(original_list.size, num_to_draw, replace=False)] = True
    drawn_numbers  = original_list[mask]
    left_over_list = original_list[~mask]

    return drawn_numbers, left_over_list

The main bottleneck of your code came from the list comprehension line, as you are testing for list membership 50800 times (This might also cause bugs if input list has repeated numbers). A faster approach is to create a boolean mask that indicated which elements are drawn.

In addition to that, numpy's default choice without replacement is known to be slow, and you can get better performance by calling choice on random number generator (the effect on performance is much less dramatic than the first one).

Finally, you could also shuffle the list first, then split it into chunks of 400.

hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51
1

Well, the main culprit is this line:

left_over_list = [i for i in original_list if i not in list(drawn_numbers)]

It creates a list on each iteration, that is extremely resource consuming, especially at start (because then it creates that list 50 800 times). So instead of doing that for each iteration, do it once and then just reference it (even better is to create a set as they are optimized for checking membership):

x = set(drawn_numbers)
left_over_list = [i for i in original_list if i not in x]

For me this little change drastically boosted performance. Before this, it took like 9 seconds for one iteration, now it takes less than 2 seconds for the entire thing.

Matiiss
  • 5,970
  • 2
  • 12
  • 29
  • WoW it reduced total time from 3min 57s to 305ms. Didn't know `set` is optimized for checking membership. – HyeonPhil Youn Dec 30 '21 at 14:44
  • 1
    @HyeonPhilYoun the `set` likely doesn't add that much to the performance (but definitely an improvement), it is just that now it doesn't do the conversion on every single iteration, you might as well use `x = list(drawn_numbers)` and that would still very much boost the performance – Matiiss Dec 30 '21 at 14:46
  • Well I tried it, I got 2min 9s, which is definitely smaller than the original one I wrote, but I guess using `set` is much better than the `list` I guess..? – HyeonPhil Youn Dec 30 '21 at 14:53
  • @HyeonPhilYoun then apparently `set` does boost performance by a lot. – Matiiss Dec 30 '21 at 15:43