3

Similar question to this: Combination of List<List<int>>

However, the shuffling makes this non-trivial and IMO requires a much different solution. I'm happy to be proved wrong, but it's not as easy as shuffling the results since the resulting list can't fit into memory:

I have 3 LARGE lists I need to combine:

List<int> A = new List<int> {1, 2, 3, ...};
List<int> B = new List<int> {4, 5, 6, ...};
List<int> C = new List<int> {7, 8, 9, ...};

The output is a new List<object> { {1, 4, 7}, {1, 4, 8}, ... }

However, I the resulting list needs to be shuffled. Shuffling each individual list will not be sufficient if they are still patterned when combined, and while the 3 lists will fit in memory, the combination of all 3 will not. A shuffled index list is also apparently too large to store in memory.

I have tried a number of different approaches, but I cannot find a way to randomize the order without loading every item first. Is this possible? Thanks!

Community
  • 1
  • 1
Adam
  • 1,984
  • 2
  • 16
  • 19
  • 1
    What type is `object`? – Tim Schmelter Nov 27 '13 at 12:12
  • @TimSchmelter essentially a struct with all 3 ints – Adam Nov 27 '13 at 12:14
  • Do you need to take all numbers out of each list without repetition or does only the total count be `A.Count+B.Count+C.Count` and repetition is allowed? – Tim Schmelter Nov 27 '13 at 12:19
  • @TimSchmelter The total count should be `A.Count*B.Count*C.Count`. I've been working around this problem by ignoring repetitions and then checking on the processing end if something has already been tested. This has worked OK, but also means I have no idea when I've completed every possible combination and no progress while doing so - basically an long-running infinite loop. – Adam Nov 27 '13 at 12:26
  • Do all lists have the same size(as i have presumed in my answer below)? I also don't understand why it is `A.Count*B.Count*C.Count`. – Tim Schmelter Nov 27 '13 at 12:35
  • @TimSchmelter Thanks for the answer. The reason its `A.Count*B.Count*C.Count` is because I need every combination possible: `{1,4,7},{1,4,8},...,{1,5,7},{1,5,8},...,{2,4,7}...` such that in each property there are n possible numbers, or n^3 combinations (if n=A.Count=B.Count=C.Count) but they might not be the same size either. – Adam Nov 27 '13 at 12:42
  • You say it's "quite like" [Combination of List>](http://stackoverflow.com/questions/545703/combination-of-listlistint) but the first answer is pretty much exactly what you want, possibly followed by shuffling the result. – Rawling Nov 27 '13 at 12:50
  • @Rawling The shuffling makes this non-trivial and IMO requires a much different solution. I'm happy to be proved wrong, but it's not as easy as shuffling the results since the resulting list can't fit into memory – Adam Nov 27 '13 at 13:04
  • My apologies, Adam, I didn't manage to fit "needs shuffling" and "can't fit in memory" into my head at the same time! – Rawling Nov 27 '13 at 13:48
  • Can I ask, why is it you need to shuffle this huge Cartesian product? Unless you're consuming a significant portion of the result somehow, would it be sufficient to generate random triples on demand and just keep track of which have already been generated? – Rawling Nov 27 '13 at 14:52
  • @Rawling I am currently generating random triples and keeping track of which ones have been processed. It works ok but there's no way to ensure the full list will be run. It will also theoretically get much slower towards the end as it has to randomly find the last few triples that haven't been run. I was just hoping there was a better way – Adam Nov 27 '13 at 15:02
  • I was just thinking about a solution based on enumerator and random, keeping track of the already processed triples. Will not post it then ;-) As you pointed out the performance will be horrible as you come to the end of the list. The space complexity might be a little better as you will be able to just store a hashcode for the already processed triples. – jbl Nov 27 '13 at 15:35
  • 1
    Well, as long as you have preprocessing time to spare, store a file containing `int`s `0 .. A.Count * B.Count * C.Count - 1` to disk and then perform a Fisher-Yates shuffle on that. It's slow as hell (lots of random disk access) but there's no shortcutting a proper shuffle. – Rawling Nov 27 '13 at 17:07

3 Answers3

0

If all lists have the same size:

var aShuffle = new List<int>(A.Count);
aShuffle.AddRange(A.Shuffle());
var bShuffle = new List<int>(B.Count);
bShuffle.AddRange(B.Shuffle());
var cShuffle = new List<int>(C.Count);
cShuffle.AddRange(C.Shuffle());

List<Trio> trios = new List<Trio>(aShuffle.Count);
for (int i = 0; i < aShuffle.Count; i++)
{
    trios.Add(new Trio { Value1 = aShuffle[i], Value2 = bShuffle[i], Value3 = cShuffle[i] });
}

With this struct:

public struct Trio
{
    public int Value1 { get; set; }
    public int Value2 { get; set; }
    public int Value3 { get; set; }
}

I've used this extension to shuffle (Fisher-Yates) the collection:

public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source)
{
    return source.Shuffle(new Random());
}

public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, Random rng)
{
    if (source == null) throw new ArgumentNullException("source");
    if (rng == null) throw new ArgumentNullException("rng");

    return source.ShuffleIterator(rng);
}

private static IEnumerable<T> ShuffleIterator<T>(
    this IEnumerable<T> source, Random rng)
{
    List<T> buffer = source.ToList();
    for (int i = 0; i < buffer.Count; i++)
    {
        int j = rng.Next(i, buffer.Count);
        yield return buffer[j];

        buffer[j] = buffer[i];
    }
}

Demonstration

Value1: 3 Value2: 5 Value3: 8
Value1: 1 Value2: 6 Value3: 9
Value1: 6 Value2: 4 Value3: 7
Value1: 2 Value2: 8 Value3: 33
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
0

Have you tried to make use of the Union LINQ-Extension?

You can use it to unify your lists and then shuffle the result :)

krizz
  • 412
  • 3
  • 5
0

To expand upon my comment...

Firstly, be sure that you can't do this shuffle in physical memory. If you have fewer than UInt32.MaxValue items you can still use managed arrays. Plus you only need to store an int (or possibly long) per item, not the item itself.

So, you have so many items you can't count them in memory?

Start off by storing a number for each item in a file.

using (var f = File.Create("shuffle.bin"))
{
    var length = A.Count * B.Count * C.Count;

    f.SetLength(length * sizeof(long));
    for (long i = 0; i < length; i++)
    {
        var bytes = BitConverter.GetBytes(i);
        f.Write(bytes, 0, sizeof(long));
    }
}

It would take forever to shuffle this file. So instead, we can shuffle one entry out at a time, on demand. This shouldn't take too long to extract one number.

using (var f = File.Open("shuffle.bin", FileMode.Open))
{
    // Calculate how many numbers are left in the file.
    long itemCount = f.Length / sizeof(long);
    if (itemCount == 0)
    {
        // We have used all the items - create another file, or throw
        // an exception, or whatever you want.
    }

    // You need an equivalent of `Random.Next(int max)` here that works on `long`s.
    long index = NextLong(itemCount);

    // Read out the number you've shuffled out.
    f.Seek(index * sizeof(long), SeekOrigin.Begin);
    var rtnBytes = new byte[sizeof(long)];
    f.Read(rtnBytes, 0, sizeof(long));

    // Read out the last number.
    f.Seek((itemCount - 1) * sizeof(long), SeekOrigin.Begin);
    var rplcBytes = new byte[sizeof(long)];
    f.Read(rplcBytes, 0, sizeof(long));

    // Replace the shuffled-out number with the last number.
    f.Seek(index * sizeof(long), SeekOrigin.Begin);
    f.Write(rplcBytes, 0, sizeof(long));

    // Trim the now-duplicate last number off the end of the file.
    f.SetLength((itemCount - 1) * sizeof(long));

    // Get the long and do with it what you want.
    long rtnLong = BitConverter.ToInt64(rtnBytes, 0);
}

You can then turn this long into a triple of indices as follows:

int indexA = (int)(rtnLong % (long)A.Count);
rtnLong /= a.Count;
int indexB = (int)(rtnLong % (long)B.Count);
rtnLong /= b.Count;
int indexC = (int)rtnLong;
Rawling
  • 49,248
  • 7
  • 89
  • 127