6

I am studying CLRS and found a problem on shuffling algorithm. Does this produce a uniformly random permutations?

1 PERMUTE-WITH-ALL-IDENTITY(A)
2    n = A.length
3    for i = 1 to n
4        swap A[i] with A[RANDOM(1,n)]
5        swap A[i] with A[RANDOM(i+1,n)] 

My claim: No. it does not. Because, there will be n^n possible permutations due to line 4. And, it is not divisible by n! which is number of distinct permutations.

Can you, please, confirm if my reasoning is correct?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Bek Abdik
  • 99
  • 4
  • 2
    But you're picking up `n!` possibilities because of line 5. – Teepeemm May 11 '15 at 14:07
  • 1
    @Teepeemm Then, the line 4 is simply ignored. Except, it makes sure that we can get element at the same place, which would not be possible without line 4. Am I right? – Bek Abdik May 11 '15 at 14:12
  • 2
    I suggest working through it when `n == 3`: see if you can figure out the probabilities of each of the six possible outcomes. (Hint: they're not equal.) – Mark Dickinson May 11 '15 at 14:14
  • This would depend largely on the qualities of your `RANDOM()` macro (function?). If it does not produce a uniform distribution, then the chances of producing a uniform probability over the set of possible permutations is likely significantly reduced... – twalberg May 11 '15 at 18:54
  • Just to ensure..is it A[RANDOM(1,n)] or A[RANDOM(i,n)] at line 4? – shole May 13 '15 at 08:11
  • @shole It is A[RANDOM(1,n)] – Bek Abdik May 13 '15 at 11:17
  • In case of i beeing n what does Rand(i + 1, n), in other word Rand(n + 1, n) even mean? – Hamid Alaei May 13 '15 at 11:53
  • Consider that this is pretty much the Fisher-Yates shuffle already except for line 5. I don't fully understand the significance of again swapping the ith index with a random index from i+1 to n, but I'm guessing it skews the permutation. – Justin Kaufman May 15 '15 at 01:00
  • @JustinKaufman: Without line 5 it's an often-implemented misunderstanding of the Fisher-Yates shuffle, and fundamentally flawed. See [Potential sources of bias](http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Potential_sources_of_bias). Or [The Danger of Naïveté](http://blog.codinghorror.com/the-danger-of-naivete/). – Jim Mischel Jun 08 '15 at 19:25

2 Answers2

3

For starters, I think there's a bug in the pseudocode. In line 4, I believe that there's a bounds error when i = n, since that would ask for a random number between n+1 and n. In what follows, I've corrected this by assuming the code is the following:

1 PERMUTE-WITH-ALL-IDENTITY(A)
2    n = A.length
3    for i = 1 to n
4        swap A[i] with A[RANDOM(1,n)]
5        swap A[i] with A[RANDOM(i,n)] 

If this is the code, then I believe the answer is no, this does not produce uniformly-random permutations. I don't have a mathematical proof that this is the case, but I do have a piece of C++ code that brute-forces all possible paths through PERMUTE-WITH-ALL-IDENTITY and counts the number of times that each permutation is produced. I ran that code and tested the algorithm on permutations on sequences of length 0 through 4, inclusive, and found that not all permutations are equally likely.

Here's the code:

#include <iostream>
#include <map>
#include <vector>
#include <algorithm>
using namespace std;

/* Maximum size of a sequence to permute. */
const size_t kMaxSize = 4;

/**
 * Given a frequencies map associating permutations to the number of times
 * that we've seen them, displays a visual report of the permutations
 * and their frequencies.
 *
 * @param frequencies The frequencies map.
 */
void reportResults(const map<vector<int>, size_t>& frequencies, size_t size) {
  cout << "Report for size " << size << endl;
  cout << "===================================================" << endl;  

  /* Print out the map. */
  cout << "  Map entries:" << endl;
  for (const auto& entry: frequencies) {
    cout << "    ";
    for (const auto& num: entry.first) {
      cout << num;
    }
    cout << ": " << entry.second << endl;
  }

  cout << "===================================================" << endl;
  cout << endl << endl;
}

/**
 * Traces through all possible executions of the algorithm, recording
 * the number of times that each outcome occurs. This algorithm uses
 * exhaustive recursion to explore the full space, assuming that the
 * underlying random generator is uniform.
 *
 * @param elems The elements in the sequence. It's assumed that initially
 *    these are in sorted order, but as the algorithm progresses it will
 *    become progressively more permuted.
 * @param frequencies A map from permutations to the number of times that
 *    they appear.
 * @param index The index through the main loop that we are currently in.
 *    This is the variable 'i' in the pseudocode.
 * @param size The length of the vector. This isn't strictly necessary,
 *    but it makes the code a bit cleaner.
 */
void recPopulate(const vector<int>& elems,
                 map<vector<int>, size_t>& frequencies,
                 size_t index, size_t size) {
  /* If we've finished permuting everything, record in the frequency map
   * that we've seen this permutation one more time.
   */
  if (index == size) {
    frequencies[elems]++;
  } else {
    /* For all possible pairs of a first swap and a second swap, try that
     * out and see what happens.
     */
    for (size_t i = 0; i < size; i++) {        // First swap index
      for (size_t j = index; j < size; j++) {  // Second swap index
        /* Make a clean copy of the vector to permute. */
        vector<int> newElems = elems;

        /* Perform the swaps. */
        swap(newElems[i], newElems[index]);
        swap(newElems[j], newElems[index]);

        /* Recursively explore all possible executions of the algorithm
         * from this point forward.
         */
        recPopulate(newElems, frequencies, index + 1, size);
      }
    }
  }
}

/**
 * Traces through all possible executions of the proposed algorithm,
 * building a frequency map associating each permutation with the
 * number of possible executions that arrive there.
 *
 * @param size The number of elements in the initial sequence.
 * @return A frequency map from permutations to the number of times that
 *    permutation can be generated.
 */
map<vector<int>, size_t> populateFrequencies(size_t size) {
  /* Create the sequence 0, 1, 2, ..., size - 1 */
  vector<int> elems(size);
  iota(elems.begin(), elems.end(), 0);

  map<vector<int>, size_t> frequencies;
  recPopulate(elems, frequencies, 0, elems.size());
  return frequencies;
}

int main() {
  for (size_t size = 0; size <= kMaxSize; size++) {
    map<vector<int>, size_t> frequencies = populateFrequencies(size);
    reportResults(frequencies, size);
  }
}

This program generates reports that show, for each permutation, the number of possible execution paths through the algorithm that produce that permutation. The output for permutations of four elements is shown below. Given that each execution path is equally likely, since the numbers here aren't the same, we can conclude that the algorithm is not uniformly random:

Report for size 4
===================================================
  Map entries:
    0123: 214
    0132: 268
    0213: 267
    0231: 316
    0312: 242
    0321: 229
    1023: 268
    1032: 322
    1203: 312
    1230: 349
    1302: 287
    1320: 262
    2013: 242
    2031: 283
    2103: 233
    2130: 262
    2301: 252
    2310: 240
    3012: 213
    3021: 208
    3102: 204
    3120: 187
    3201: 248
    3210: 236
===================================================

The above analysis hinges on the fact that

  1. My code is correct, and
  2. My interpretation of the pseudocode is correct.

If either of these are wrong, I'd be happy to retract or edit this answer. Please let me know if I've made a mistake here!

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
1

The analysis above suggests a chi of 147 with 23 degrees of freedom, which means that the P-Value is < 0.00001. This indicates a very poor fit with a presumed uniform distribution.

But.

There only seems to be 6144 total samples. If you're looking at randomness, I would have thought that more runs would be appropriate. It could be that the P-Value moves towards a more favourable position after a 1000ish runs. Don't reseed the random generator between runs though.

Paul Uszak
  • 377
  • 4
  • 18
  • Which analysis are you referring to? – templatetypedef Jul 27 '15 at 11:35
  • Oh, that's not me running tests and counting the results. The code I included brute-force tries all possible executions of the algorithm - which are all equally probable due to how the code is structured - and therefore should be constructive proof that the algorithm isn't random rather than statistical evidence. – templatetypedef Jul 27 '15 at 19:52
  • Ah. Doesn't the random statement in swap A[i] with A[RANDOM(1,n)] suggest randomised swapping aka a shuffle algorithm? Wouldn't that require re-runs because you could get 1234 several times randomly for example? – Paul Uszak Jul 27 '15 at 20:13
  • The backtracking algorithm I wrote above essentially tries every single possibility whenever there's a choice to be made. Essentially, it's enumerating all possible outcomes of the random choice independently of one another and considering each option to see what can happen. – templatetypedef Jul 27 '15 at 22:48