More efficient shuffling during repeated iterations

Question

I need to iterate over a list an arbitrary number of times, yielding each element of the list in random order (a different order each time I iterate through the full list). I need to yield each element once before yielding that element a second time, yield each element twice before yielding that element a third time, etc.

Currently, my code looks like this:

def random_yield(data):
  random.shuffle(data)
  data_index = 0
  while True:
    yield data[data_index]
    data_index += 1

    if data_index == len(data):
      random.shuffle(data)
      data_index = 0

Is there a way to do this more efficiently so I don't pay the performance penalty of the random.shuffle() after every len(data) yields?

"I need to yield each element once before yielding that element a second time.." What? If you do something the first time, the second time you do it you have done it twice, just for the record. — nbro, Mar 09 '17 at 23:19
I'm looping over the list an arbitrary number of times. So if the list is [1,2,3] then I could yield `1, 2, 3, 2, 3, 1, 3, 2` but not `1, 1, 1, 3, 2...` This unfortunately prohibits the easy solution of just randomly picking an index and yielding `data[random_idx]` — Thomas Johnson, Mar 09 '17 at 23:20
This is sort of an aside, but your generator implementation could be greatly simplified (no need to keep track of `data_index`) — juanpa.arrivillaga, Mar 09 '17 at 23:23
The random shuffle involves exactly `n` iterations of "get a random number and swap two elements", and you do it every `n` yields, so on average it happens once per yield. Your "easy solution" involves getting a random number once per yield. Why do you feel that one of those is very efficient and the other one terribly slow? — rici, Mar 09 '17 at 23:26
Please bear with me: what is the exact problem here? Your code looks fine, does what you want and has perfect complexity in average. What do you want to achieve? Do you want n-times O(1) instead of paying the price of shuffling at some point? (some kind of real-time constraint) — sascha, Mar 09 '17 at 23:26
@sascha Yes, making each call take approximately the same time is what I'm trying to achieve. This code gets run in parallel (really parallel, not GIL-constricted parallel) to another process that consumes the results. During that `random.shuffle()` time, the consuming process starves — Thomas Johnson, Mar 09 '17 at 23:29
See this [gist](https://gist.github.com/juanarrivillaga/f67e5fc9b601bfadba79f0d25887b3c3) for a more simpler implementation, although it doesn't address your issue. I don't think you will be able to avoid the equivalent of a shuffle operation ever time you go through `len(data)` items. — juanpa.arrivillaga, Mar 09 '17 at 23:30
Does [this help](http://stackoverflow.com/questions/196017/unique-non-repeating-random-numbers-in-o1#196065)? — sascha, Mar 09 '17 at 23:30

rici · Accepted Answer · 2017-03-09T23:57:42.860

3

You can do a Fisher-Yates shuffle one step per iteration, thereby distributing the cost evenly over each iteration. That's not more efficient -- in fact, it is probably less efficient, given that the library function is probably faster than Python code -- but it avoids long pauses.

The code is not significantly different from just grabbing a random element each time. The only difference is that you grab the random elements from a subset of the vector:

from random import randrange
def random_yield(data):
  index = 0
  limit = len(data)
  while True:
    if index + 1 >= limit:
      yield data[index]
      index = 0
    else:
      # Get a random element which we haven't yet used this cycle
      # (This is a single iteration of the F-Y shuffle algorithm)
      j = randrange(index, limit)
      rv = data[j]
      yield rv
      # Swap the element we just selected so its not in the next subrange
      data[j] = data[index]
      data[index] = rv
      index += 1

edited Mar 09 '17 at 23:57

answered Mar 09 '17 at 23:52

rici

234,347
28
237
341

Same idea, but without the special-casing of the last element: https://gist.github.com/paulhankin/30ffd466055c18faa957b316b3d3bce9 . I think it's easier to understand. – Paul Hankin Mar 10 '17 at 00:14
1

Let me put that a different way. I find it easier to understand without the modulo because for me it is evident that what it is doing is choosing a random element from the set of elements not yet chosen. For me, faced with the modulo operator, I have to deduce what I just said; it is no longer *evident* (i.e. visible, which is the root of evidence). But that's just me. I could be wildly non-representative, but nonetheless I code according to my own intuitions. Luckily, it's a wide world and we all have our place in the choir. – rici Mar 10 '17 at 00:31

More efficient shuffling during repeated iterations

1 Answers1