8

I need to create a list with one billion integers and they must all be unique. I also need this to be done extremely fast.

Creating a list and adding random numbers one by one and checking to see if each one is a duplicate is extremely slow.

It seems to be quite fast if I just populate a list with random numbers without checking if they are duplicates and then using distinct().toList(). I repeat this until there are no more duplicates. However the extra memory used by creating a new list is not optimal. Is there a way to get the performance of distinct() but instead of creating a new list it just modifies the source list?

DP.
  • 83
  • 1
  • 1
  • 6
  • 2
    How large is the range in which you want to create the integers? – CodesInChaos Nov 02 '11 at 14:14
  • This was previously asked for the C language. The highest-voted answer (as opposed to the accepted answer) has some advice that can be ported to C#. http://stackoverflow.com/questions/1608181/unique-random-numbers-in-an-integer-array-in-the-c-programming-language – David Nov 02 '11 at 14:17
  • 1
    What datatype are you storing these ints in? A 32-bit Integer can only hold 2.1bn anyway (2,147,483,647), so there's a limit to how 'random' you can really be. – Widor Nov 02 '11 at 14:17
  • There is a similiar post here (it's a list of 300 unique numbers) http://stackoverflow.com/questions/5561742/generate-distinct-random-numbers-in-c-sharp – Carol Skelly Nov 02 '11 at 14:17
  • than just add Numbers from 1 to int.MaxValue ,that will be unique for sure . – Rosmarine Popcorn Nov 02 '11 at 14:19
  • from 1-1000000000. Perhaps all I need is an extremely fast and efficient shuffle algorithm because I cannot have them in order. – DP. Nov 02 '11 at 14:26
  • Do 1 billion ints even fit into a single `List`. I thought .net supports only 2GB arrays. – CodesInChaos Nov 02 '11 at 14:28
  • @CodeInChaos, memory is the only limit for array sizes in .NET. On most x86 Windows editions attempting to do this will result in an OutOfMemoryException but on x64 Windows editions it will work. – Darin Dimitrov Nov 02 '11 at 14:31
  • 1
    @Darin Did they add that feature in .net 4? In .net 2 the 2GB limit applied even to 64 bit applications. http://blogs.msdn.com/b/joshwil/archive/2005/08/10/450202.aspx – CodesInChaos Nov 02 '11 at 14:34
  • @CodeInChaos, you seem to be correct. I didn't know about this restriction. – Darin Dimitrov Nov 02 '11 at 14:37
  • 1
    You have not included nearly enough information in the question to give a solution. For example, do you require the list to be in a random order, or merely to contain random integers? ("Pick Six" style lottery tickets choose six random distinct integers; the order does not matter. But when choosing six random songs from a list and playing them, the order should be random as well.) What is the range of the integers? Can they be negative? Can they be larger than what fits in an Int32? An Int64? – Eric Lippert Nov 02 '11 at 16:51

7 Answers7

15

I found this the fastest while maintaining randomness:

        Random rand = new Random();
        var ints = Enumerable.Range(0, numOfInts)
                                     .Select(i => new Tuple<int, int>(rand.Next(numOfInts), i))
                                     .OrderBy(i => i.Item1)
                                     .Select(i => i.Item2);

...basically assigning a random id to each int and then sorting by that id and selecting the resulting list of ints.

Pierre
  • 1,607
  • 3
  • 21
  • 33
14

Do the integers need to be in a certain range? If so, you could create an array or list with all numbers in that range (for example from 1 to 1000000000) and shuffle that list.

CodeCaster
  • 147,647
  • 23
  • 218
  • 272
2

If the amount of possible integers from which you draw is significantly larger (say factor 2) than the amount of integers you want you can simply use a HashSet<T> to check for duplicates.

List<int> GetUniqueRandoms(Random random, int count)
{
  List<int> result = new List<int>(count);
  HashSet<int> set = new HashSet<int>(count);
  for(int i = 0; i < count; i++)
  {
    int num;

    do
    {
      num = random.NextInt();
    while(!set.Add(num));

    result.Add(num);
  }
  return result;
}

This allocates the collections with the correct capacity to avoid reallocation during growth. Since your collections are large this should be a large improvement.

You can also use Distinct a single time:

IEnumerable<int> RandomSequence(Random random)
{
    while(true)
    {
      yield return random.NextInt();
    }
}

RandomSequence(rand).Distinct().Take(1000000000).ToList();

But with both solutions you need enough memory for a HashSet<int> and a List<int>.


If the amount of possible integers from which you draw is about as large as the amount of integers you want, you can create an array that contains all of them, shuffle them and finally cut off those you're not interested in.

You can use Jon Skeet's shuffle implementation.

Community
  • 1
  • 1
CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
2

You can track duplicates in a separate HashSet<int>:

var set = new HashSet<int>();
var nums = new List<int>();

while(nums.Count < 1000000000) {
    int num;
    do {
        num = rand.NextInt();
    } while (!set.Contains(num));
    set.Add(num);
    list.Add(num);
}

You need a separate List<int> to store the numbers because a hashset will not preserve your random ordering.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • 3
    This will be increasingly slower when the list gets more filled. It might literally take hours to find the last number that fits into the list. – CodeCaster Nov 02 '11 at 14:19
  • @CodeCaster: Yes. If shuffling is an option, it's a much better option. – SLaks Nov 02 '11 at 14:20
  • Having two objects containing all the ints is not an option. – DP. Nov 02 '11 at 14:41
  • 1
    The inner loop will run forever because _set_ will initially be empty and will never contain _num_. Should remove the "!" from the do-loop condition. – Ingvi Gautsson Nov 20 '19 at 13:11
2

Taking the question literally (a list with one billion integers and they must all be unique):

Enumerable<int>.Range(0, 1000000000)

But along the lines of CodeCaster's answer, you can create the list and shuffle it at the same time:

var count = 1000000000;
var list = new List<int>(count);
var random = new Random();
list.Add(0);
for (var i = 1; i < count; i++)
{
    var swap = random.Next(i - 1);
    list.Add(list[swap]);
    list[swap] = i;
}
James Clark
  • 1,765
  • 13
  • 17
0

You can trick LINQ to jumble the numbers for you by supplying a random number lambda to OrderBy:

Random rand = new Random();
var ints = Enumerable.Range(0, 1000000000).OrderBy(i => rand.Next());
Stephen Quan
  • 21,481
  • 4
  • 88
  • 75
0

What if you created the list in a sorted but still random fashion (such as adding a random number to the last element of the list as the next element), and then shuffled the list with a Fisher-Yates-Durstenfeld? That would execute in linear time overall, which is pretty much as good as it gets for list generation. However, it might have some significant bias that would affect the distribution.

KeithS
  • 70,210
  • 21
  • 112
  • 164