11

I would like to randomly iterate through a range. Each value will be visited only once and all values will eventually be visited. For example:

class Array
    def shuffle
        ret = dup
        j = length
        i = 0
        while j > 1
            r = i + rand(j)
            ret[i], ret[r] = ret[r], ret[i]
            i += 1
            j -= 1
        end
        ret
    end
end

(0..9).to_a.shuffle.each{|x| f(x)}

where f(x) is some function that operates on each value. A Fisher-Yates shuffle is used to efficiently provide random ordering.

My problem is that shuffle needs to operate on an array, which is not cool because I am working with astronomically large numbers. Ruby will quickly consume a large amount of RAM trying to create a monstrous array. Imagine replacing (0..9) with (0..99**99). This is also why the following code will not work:

tried = {} # store previous attempts
bigint = 99**99
bigint.times {
    x = rand(bigint)
    redo if tried[x]
    tried[x] = true
    f(x) # some function
}

This code is very naive and quickly runs out of memory as tried obtains more entries.

What sort of algorithm can accomplish what I am trying to do?

[Edit1]: Why do I want to do this? I'm trying to exhaust the search space of a hash algorithm for a N-length input string looking for partial collisions. Each number I generate is equivalent to a unique input string, entropy and all. Basically, I'm "counting" using a custom alphabet.

[Edit2]: This means that f(x) in the above examples is a method that generates a hash and compares it to a constant, target hash for partial collisions. I do not need to store the value of x after I call f(x) so memory should remain constant over time.

[Edit3/4/5/6]: Further clarification/fixes.

[Solution]: The following code is based on @bta's solution. For the sake of conciseness, next_prime is not shown. It produces acceptable randomness and only visits each number once. See the actual post for more details.

N = size_of_range
Q = ( 2 * N / (1 + Math.sqrt(5)) ).to_i.next_prime
START = rand(N)

x = START
nil until f( x = (x + Q) % N ) == START # assuming f(x) returns x
void
  • 33
  • 1
  • 7
  • 2
    You obviously aren't storing the result of your function invocation, as that would also take up a lot of memory. So what exactly are you doing? Why do you need to do this in a random order? If you were just accumulating the values, order would likely be irrelevant. I'd like to know more if you want a solution. – Turtle Mar 17 '10 at 05:04
  • If you don't need the results back in an array, change the sample code `(0..9).sort_by{rand}.map{|x| f(x)}` to use `each` instead of `map`. That will make the question clearer. – Harish Shetty Mar 17 '10 at 21:56
  • 1
    `sort_by rand` is also not correct; it will give biased results. See http://www.robweir.com/blog/2010/02/microsoft-random-browser-ballot.html (JavaScript, but same concept). – Matthew Flaschen Mar 18 '10 at 04:48
  • 1
    As @Matthew Flaschen wrote, your attempt to randomize the order of the list is horribly broken and will return results which may look random, but which aren't. His link gives a good description of the problem. – Turtle Mar 18 '10 at 05:25
  • void, you missed the point. That link was what *not* to do. You can't sort by any random function (a shifted random function is no better). – Matthew Flaschen Mar 18 '10 at 05:56
  • Okay, I see what you're saying. I changed the example to use a Fisher-Yates shuffle. – void Mar 18 '10 at 15:35
  • Created an iterator out of this: http://gist.github.com/363914 – Colin Curtin Apr 12 '10 at 19:35

11 Answers11

12

I just remembered a similar problem from a class I took years ago; that is, iterating (relatively) randomly through a set (completely exhausting it) given extremely tight memory constraints. If I'm remembering this correctly, our solution algorithm was something like this:

  1. Define the range to be from 0 to some number N
  2. Generate a random starting point x[0] inside N
  3. Generate an iterator Q less than N
  4. Generate successive points x[n] by adding Q to the previous point and wrapping around if needed. That is, x[n+1] = (x[n] + Q) % N
  5. Repeat until you generate a new point equal to the starting point.

The trick is to find an iterator that will let you traverse the entire range without generating the same value twice. If I'm remembering correctly, any relatively prime N and Q will work (the closer the number to the bounds of the range the less 'random' the input). In that case, a prime number that is not a factor of N should work. You can also swap bytes/nibbles in the resulting number to change the pattern with which the generated points "jump around" in N.

This algorithm only requires the starting point (x[0]), the current point (x[n]), the iterator value (Q), and the range limit (N) to be stored.

Perhaps someone else remembers this algorithm and can verify if I'm remembering it correctly?

bta
  • 43,959
  • 6
  • 69
  • 99
  • 1
    I think this is as good as you can get if you won't store the tried inputs and can't have duplicates. There is really no need for a truly random shuffle if you are going to test all inputs and they don't interfere. To spread the choices out as much as possible, use a Q close to the golden section (2N/(1+sqrt(5))). – mckeed Mar 18 '10 at 21:37
  • This sounds almost exactly like what I want to do. I am not overly concerned about randomness, but it is very important. If anyone knows the name of this algorithm, that would be great. – void Mar 19 '10 at 04:00
  • I'm not sure if there's a name for the algorithm. The specific principle it's based off of (a mathematical property of prime numbers with respect to modular arithmetic) might have a name though. – bta Mar 19 '10 at 19:31
  • 3
    See http://en.wikipedia.org/wiki/Full_cycle (and perhaps http://en.wikipedia.org/wiki/Linear_congruential_generator) – Lars Haugseth May 08 '12 at 21:48
3

As @Turtle answered, you problem doesn't have a solution. @KandadaBoggu and @bta solution gives you random numbers is some ranges which are or are not random. You get clusters of numbers.

But I don't know why you care about double occurence of the same number. If (0..99**99) is your range, then if you could generate 10^10 random numbers per second (if you have a 3 GHz processor and about 4 cores on which you generate one random number per CPU cycle - which is imposible, and ruby will even slow it down a lot), then it would take about 10^180 years to exhaust all the numbers. You have also probability about 10^-180 that two identical numbers will be generated during a whole year. Our universe has probably about 10^9 years, so if your computer could start calculation when the time began, then you would have probability about 10^-170 that two identical numbers were generated. In the other words - practicaly it is imposible and you don't have to care about it.

Even if you would use Jaguar (top 1 from www.top500.org supercomputers) with only this one task, you still need 10^174 years to get all numbers.

If you don't belive me, try

tried = {} # store previous attempts
bigint = 99**99
bigint.times {
  x = rand(bigint)
  puts "Oh, no!" if tried[x]
  tried[x] = true
}

I'll buy you a beer if you will even once see "Oh, no!" on your screen during your life time :)

klew
  • 14,837
  • 7
  • 47
  • 59
  • Thanks for the useful information. The range (0..99**99) was just an example. The hashing algorithm I am testing against has a search space that is exhaustible in a realistic amount of time for realistic length inputs. I just wanted my algorithm to scale efficiently while giving each number the same probability of being selected. As for the beer, I think the sun has a higher probability of spontaneously teleporting to the other side of the galaxy :) – void Mar 18 '10 at 20:50
  • The search space I am testing is (0..(80**N-1)) for an input length of N. – void Mar 19 '10 at 03:41
  • For N = 11 it will took 34 years to exhaust all the numbers with the same speed as in my example above. So probably when you are using ruby and you are not only generating numbers, but also do some calculations with them, then you shouldn't care about repetitive numbers, because it will took ages to exhaust all posibilities. On the other side, for N = 6, you can store all tried numbers on a single bit in array - it will took about 409 MB. With N = 7 you should have about 32 GB of memory - so probably you should store it on hdd. But again it will took a lot of time. – klew Mar 19 '10 at 10:25
  • On my computer simple loop like this: `a = 80**4; b = 0; a.times {b = b+1}` took about 16 seconds. It means that when you increase N by one, this time will increase 80 times, so for N=6 it will took 24 minutes, for N=7, 28 hours, for N=8, more than 9 days. With this calculation it gives 13300 years for N=11 (this is real example on one core with 2.13 GHz). – klew Mar 19 '10 at 10:29
  • I looks like you messed up your math. Going from `N=7` to `N=8` you multiplied by 8 instead of 80. The actual time for `N=8` is slightly more than 3 months. Given sufficient randomness in selecting a key to test, the average case time is cut in half. Taking advantage of a multicore CPU will divide the average case time by the number of cores you have. If more efficiency is needed, I could switch to a different language. Taking it to the next level, I could use my GPU for stream processing. – void Mar 19 '10 at 18:33
1

you can randomly iterate an array with shuffle method

a = [1,2,3,4,5,6,7,8,9]
a.shuffle!
=> [5, 2, 8, 7, 3, 1, 6, 4, 9]
shweta
  • 8,019
  • 1
  • 40
  • 43
1

You want what's called a "full cycle iterator"...

Here is psudocode for the simplest version which is perfect for most uses...

function fullCycleStep(sample_size, last_value, random_seed = 31337, prime_number = 32452843) {
if last_value = null then last_value = random_seed % sample_size
    return (last_value + prime_number) % sample_size
}

If you call this like so:

sample = 10
For i = 1 to sample
    last_value = fullCycleStep(sample, last_value)
    print last_value
next

It would generate random numbers, looping through all 10, never repeating If you change random_seed, which can be anything, or prime_number, which must be greater than, and not be evenly divisible by sample_size, you will get a new random order, but you will still never get a duplicate.

Nick Steele
  • 7,419
  • 4
  • 36
  • 33
1

I could be wrong, but I don't think this is doable without storing some state. At the very least, you're going to need some state.

Even if you only use one bit per value (has this value been tried yes or no) then you will need X/8 bytes of memory to store the result (where X is the largest number). Assuming that you have 2GB of free memory, this would leave you with more than 16 million numbers.

Turtle
  • 1,320
  • 10
  • 11
1

Break the range in to manageable batches as shown below:

def range_walker range, batch_size = 100
  size = (range.end - range.begin) + 1
  n = size/batch_size 
  n.times  do |i|
    x = i * batch_size + range.begin
    y = x + batch_size
    (x...y).sort_by{rand}.each{|z| p z}
  end
  d = (range.end - size%batch_size + 1)
  (d..range.end).sort_by{rand}.each{|z| p z }
end

You can further randomize solution by randomly choosing the batch for processing.

PS: This is a good problem for map-reduce. Each batch can be worked by independent nodes.

Reference:

Map-reduce in Ruby

Harish Shetty
  • 64,083
  • 21
  • 152
  • 198
  • Even if "n" and "batch_size" were the same number (sqrt(n)), the arrays generated would be too large to store in memory. Nice approach though. I think the final algorithm must do something similar to this, except the arrays would be of a manageable size. – void Mar 17 '10 at 15:52
  • In your question it wasn't clear that you wanted the results as an array. I thought you just wanted to randomly process numbers in a range ensuring that every number is processed. This solution does that regardless of the range size. If you want to return these numbers as an array then you have a different problem. – Harish Shetty Mar 17 '10 at 16:41
  • I'm sorry for not clarifying. I don't want the results as an array. Somewhere inside that loop I would like to call a method that takes the generated random number as input. Memory usage should remain constant in the long term. – void Mar 17 '10 at 17:02
  • Try calling range_walker(0..99**99) and you'll see what I mean. – void Mar 17 '10 at 17:12
  • I have fixed the problem. Try again. Memory consumption will remain the same. CPU nears 60% due to continuous processing. – Harish Shetty Mar 17 '10 at 18:20
  • Am I understanding this code correctly? The range is divided into batches. Each batch has a random distribution. However, the batches are still visited in order when they need to be visited randomly. Now we're back to the same problem. :-) – void Mar 18 '10 at 04:44
  • @void: It's a trade-off between random-ness and memory usage. You save quite a bit of memory by visiting the batches in order. Pretty much any solution is going to sacrifice random-ness for the sake of memory usage as long as there is a restriction that every input is visited exactly once. – bta Mar 18 '10 at 20:35
  • @void: Another way of looking at this: the batches aren't visited in order, they are visited in parallel. Use a multi-processor, multi-core machine and load up a batch on each core. This type of problem seems to be extremely parallelizable, and this solution seems to break it up into parallel chunks. – bta Mar 18 '10 at 21:15
  • I agree. Parallelization is highly effective for this situation. I just wanted the algorithm to scale nicely to extremely large ranges without using much memory, so I gave a ridiculous example. – void Mar 19 '10 at 03:47
0

Database systems and other large-scale systems do this by writing the intermediate results of recursive sorts to a temp database file. That way, they can sort massive numbers of records while only keeping limited numbers of records in memory at any one time. This tends to be complicated in practice.

yfeldblum
  • 65,165
  • 12
  • 129
  • 169
0

How "random" does your order have to be? If you don't need a specific input distribution, you could try a recursive scheme like this to minimize memory usage:

def gen_random_indices
  # Assume your input range is (0..(10**3))
  (0..3).sort_by{rand}.each do |a|
    (0..3).sort_by{rand}.each do |b|
      (0..3).sort_by{rand}.each do |c|
        yield "#{a}#{b}#{c}".to_i
      end
    end
  end
end

gen_random_indices do |idx|
  run_test_with_index(idx)
end

Essentially, you are constructing the index by randomly generating one digit at a time. In the worst-case scenario, this will require enough memory to store 10 * (number of digits). You will encounter every number in the range (0..(10**3)) exactly once, but the order is only pseudo-random. That is, if the first loop sets a=1, then you will encounter all three-digit numbers of the form 1xx before you see the hundreds digit change.

The other downside is the need to manually construct the function to a specified depth. In your (0..(99**99)) case, this would likely be a problem (although I suppose you could write a script to generate the code for you). I'm sure there's probably a way to re-write this in a state-ful, recursive manner, but I can't think of it off the top of my head (ideas, anyone?).

bta
  • 43,959
  • 6
  • 69
  • 99
  • As random as it can be. This is so it can efficiently exhaust the search space. It is also what makes a birthday attack possible, dramatically cutting search time. Think of it as brute-forcing the combination to a lock. – void Mar 18 '10 at 04:57
0

[Edit]: Taking into account @klew and @Turtle's answers, the best I can hope for is batches of random (or close to random) numbers.


This is a recursive implementation of something similar to KandadaBoggu's solution. Basically, the search space (as a range) is partitioned into an array containing N equal-sized ranges. Each range is fed back in a random order as a new search space. This continues until the size of the range hits a lower bound. At this point the range is small enough to be converted into an array, shuffled, and checked.

Even though it is recursive, I haven't blown the stack yet. Instead, it errors out when attempting to partition a search space larger than about 10^19 keys. I has to do with the numbers being too large to convert to a long. It can probably be fixed:

# partition a range into an array of N equal-sized ranges
def partition(range, n)
    ranges = []
    first = range.first
    last = range.last
    length = last - first + 1
    step = length / n # integer division
    ((first + step - 1)..last).step(step) { |i|
        ranges << (first..i)
        first = i + 1
    }
    # append any extra onto the last element
    ranges[-1] = (ranges[-1].first)..last if last > step * ranges.length
    ranges
end

I hope the code comments help shed some light on my original question.

pastebin: full source

Note: PW_LEN under # options can be changed to a lower number in order to get quicker results.

void
  • 33
  • 1
  • 7
  • It's nice, but you see how it's not a real shuffle, right? The first number will be randomly distributed, but then the next BLOCK_SIZE numbers will all be from the same range. – mckeed Mar 18 '10 at 21:09
  • Unless I'm misunderstanding your comment, Fisher-Yates is a real shuffle and it is used in the correct way. Each block is partitioned and visited in a random order. However, the best it can do is batches of random numbers... – void Mar 19 '10 at 04:14
0

For a prohibitively large space, like

space = -10..1000000000000000000000

You can add this method to Range.

class Range

  M127 = 170_141_183_460_469_231_731_687_303_715_884_105_727

  def each_random(seed = 0)
    return to_enum(__method__) { size } unless block_given?
    unless first.kind_of? Integer
      raise TypeError, "can't randomly iterate from #{first.class}"
    end

    sample_size = self.end - first + 1
    sample_size -= 1 if exclude_end?
    j = coprime sample_size
    v = seed % sample_size
    each do
      v = (v + j) % sample_size
      yield first + v
    end
  end

protected

  def gcd(a,b)
    b == 0 ? a : gcd(b, a % b)
  end

  def coprime(a, z = M127)
    gcd(a, z) == 1 ? z : coprime(a, z + 1)
  end

end

You could then

space.each_random { |i| puts i }

729815750697818944176
459631501395637888351
189447252093456832526
919263002791275776712
649078753489094720887
378894504186913665062
108710254884732609237
838526005582551553423
568341756280370497598
298157506978189441773
27973257676008385948
757789008373827330134
487604759071646274309
217420509769465218484
947236260467284162670
677052011165103106845
406867761862922051020
136683512560740995195
866499263258559939381
596315013956378883556
326130764654197827731
55946515352016771906
785762266049835716092
515578016747654660267
...

With a good amount of randomness so long as your space is a few orders smaller than M127.

Credit to @nick-steele and @bta for the approach.

captainpete
  • 6,162
  • 3
  • 28
  • 26
0

This isn't really a Ruby-specific answer but I hope it's permitted. Andrew Kensler gives a C++ "permute()" function that does exactly this in his "Correlated Multi-Jittered Sampling" report.

As I understand it, the exact function he provides really only works if your "array" is up to size 2^27, but the general idea could be used for arrays of any size.

I'll do my best to sort of explain it. The first part is you need a hash that is reversible "for any power-of-two sized domain". Consider x = i + 1. No matter what x is, even if your integer overflows, you can determine what i was. More specifically, you can always determine the bottom n-bits of i from the bottom n-bits of x. Addition is a reversible hash operation, as is multiplication by an odd number, as is doing a bitwise xor by a constant. If you know a specific power-of-two domain, you can scramble bits in that domain. E.g. x ^= (x & 0xFF) >> 5) is valid for the 16-bit domain. You can specify that domain with a mask, e.g. mask = 0xFF, and your hash function becomes x = hash(i, mask). Of course you can add a "seed" value into that hash function to get different randomizations. Kensler lays out more valid operations in the paper.

So you have a reversible function x = hash(i, mask, seed). The problem is that if you hash your index, you might end up with a value that is larger than your array size, i.e. your "domain". You can't just modulo this or you'll get collisions.

The reversible hash is the key to using a technique called "cycle walking", introduced in "Ciphers with Arbitrary Finite Domains". Because the hash is reversible (i.e. 1-to-1), you can just repeatedly apply the same hash until your hashed value is smaller than your array! Because you're applying the same hash, and the mapping is one-to-one, whatever value you end up on will map back to exactly one index, so you don't have collisions. So your function could look something like this for 32-bit integers (pseudocode):

fun permute(i, length, seed) {
  i = hash(i, 0xFFFF, seed)
  while(i >= length): i = hash(i, 0xFFFF, seed)
  return i
}

It could take a lot of hashes to get to your domain, so Kensler does a simple trick: he keeps the hash within the domain of the next power of two, which makes it require very few iterations (~2 on average), by masking out the unnecessary bits. The final algorithm looks like this:

fun next_pow_2(length) {
  # This implementation is for clarity.
  # See Kensler's paper for one way to do it fast.
  p = 1
  while (p < length): p *= 2
  return p
}

permute(i, length, seed) {
  mask = next_pow_2(length)-1
  i = hash(i, mask, seed) & mask
  while(i >= length): i = hash(i, mask, seed) & mask
  return i
}

And that's it! Obviously the important thing here is choosing a good hash function, which Kensler provides in the paper but I wanted to break down the explanation. If you want to have different random permutations each time, you can add a "seed" value to the permute function which then gets passed to the hash function.

Andrew
  • 383
  • 2
  • 17