1

I have a huge linked list of integers (let's say it has size N, but N is unknown to me) and want to get k random values from it in minimum possible time/space.

I think it must be possible to write a variation of inside-out Fisher-Yates shuffle that will solve this problem in O(N) time and O(k) additional space.

Can anyone help me to get statistically correct solution with specified time/space bounds?

I think my current code is close to the correct solution:

public class Node
{
    public int Data;

    public Node Next;

    // O(N) time, O(k) additional space
    public int[] GetRandomData(int k)
    {
        var a = new int[k];
        var rand = new Random();

        int n = 0;
        Node cur = this;
        while (cur != null)
        {
            int r = rand.Next(0, n);

            if (r < k)
            {
                a[r] = cur.Data;
            }

            cur = cur.Next;          
        }

        if (n < k) throw new ArgumentException("k is bigger than N");        
        return a;
    }
}
ptkvsk
  • 2,096
  • 1
  • 25
  • 47
  • Why not just get k different random values between [0..N-1] ? – Shmoopy Jan 23 '14 at 13:09
  • possible duplicate of [Randomize a List in C#](http://stackoverflow.com/questions/273313/randomize-a-listt-in-c-sharp) – weston Jan 23 '14 at 13:10
  • This is a linked list, guys. I want to do it with only O(k) additional space. – ptkvsk Jan 23 '14 at 13:11
  • This is an interesting problem that shouldn't count as a duplicate - the suggested duplicate link requires random access for an efficient shuffle which isn't the case here. I'm working on some thoughts and will post an answer shortly if I can – Fish Jan 23 '14 at 13:16
  • 1
    The code you had looked close, just missed the "if the random slot is occupied, move its contents to the end of the list"-type step. – Rawling Jan 23 '14 at 13:20
  • 1
    Why notg enerate a list of `k` integers in the range of `[0,N-1]` sort it (simple). Then traverse the original list and for every value in the list with integers you print the item. Takes `O(N)` time and requires `O(K)` extra space – invalid_id Jan 23 '14 at 13:21
  • Rawling, this is the place I have doubts about. Can you explain to me why this step is needed? – ptkvsk Jan 23 '14 at 13:24
  • invalid_id, interesting approach! However, you need O(k log k) time to sort, but this shouldn't be a problem. Another problem might be that you need to generate k DISTINCT numbers in order to solve the problem correctly. – ptkvsk Jan 23 '14 at 13:25
  • @lonelyass Generating `k` distinct numbers can be done. The problem is that is should be really random so it can take additional some time (though expected time is still O(k)`. I wasn't sure about how big `k` is, if it is much smaller than `n` the `O(k log k)` time can be hidden in `O(n)`, that is if `O(k) <= O(log n)`. Furthermore, it's really simple :P – invalid_id Jan 24 '14 at 06:38

3 Answers3

2

This returns a uniformly distributed random sample of k items from a sequence of unknown length. The algorithm is called reservoir sampling.

def sample(seq, k):
    seq = iter(seq)
    result = [seq.next() for _ in xrange(k)]
    for i, s in enumerate(seq):
        r = random.randrange(i + k + 1)
        if r < k: result[r] = s
    return result
Paul Hankin
  • 54,811
  • 11
  • 92
  • 118
0

I ended up with this version (C#) and I'm pretty sure that it's correct. Tell me if I'm wrong.

public class Node
{
    public int Data;

    public Node Next;

    // O(N) time, O(k) additional space
    public int[] GetRandomData(int k)
    {
        var a = new int[k];
        var rand = new Random();

        a[0] = this.Data;
        int i = 1;

        for (Node cur = this.Next; cur != null; cur = cur.Next, i = i + 1)
        {
            int r = rand.Next(0, i + 1);

            if (r < k)
            {
                if (i < k)
                {
                    a[i] = a[r];
                }

                a[r] = cur.Data;
            }
        }

        if (i < k) throw new ArgumentException("k is bigger than N");
        return a;
    }
}

UPD: ok, so my code is identical to what's written here in wikipedia, so it must be statistically correct.

ptkvsk
  • 2,096
  • 1
  • 25
  • 47
-1

This is the implementation of sattolo algorithm

from random import randrange  
def sattoloCycle(items):
    i = len(items)
    while i > 1:
        i = i - 1
        j = randrange(i)  # 0 <= j <= i-1
        items[j], items[i] = items[i], items[j]
    return
Mike Minaev
  • 1,912
  • 4
  • 23
  • 33
  • This finds cycles and does not take in mind that I have linked list, so it's irrelevant to the question. – ptkvsk Jan 23 '14 at 13:29