1

From a start and end where both data types are long's, I'd like to produce a randomly sorted list with them.

At the moment, I'm using a for loop to populate a list:

for (var i = idStart; i < idEnd; i++){ list.Add(i); }

Then I'm shuffle'ing the list using an extension method. However, when the difference between start and end are large (millions), the for loop causes out of memory exceptions.

Is there a more efficient, sleeker method for producing an randomly sequenced list of long's, where each number only appears once?

user3791372
  • 4,445
  • 6
  • 44
  • 78
  • 2
    How are you shuffling your list now that's causing the out of memory error? – Ben Rubin Nov 24 '15 at 20:31
  • 2
    Most shuffles (e.g. https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle) can be done in place. There shouldn't be any memory problems. – itsme86 Nov 24 '15 at 20:34
  • @BenRubin It isn't the shuffle that's causing the out of memory exception, but the adding of the longs to a list. If I were to use an array no exception occurs – user3791372 Nov 24 '15 at 20:35
  • 1
    What means "large"? Please show your code... – Shnugo Nov 24 '15 at 20:37
  • If I understand you correctly, you want to shuffle ALL integer numbers between any two long-values? Maybe you should describe a little more what you are really trying to achieve... – Shnugo Nov 24 '15 at 20:40
  • @user3791372 There's no reason why you should be running out of memory while creating the list unless your list is hundreds of millions of elements (infinite loop maybe). It would help if you show your code for creating the list. – Ben Rubin Nov 24 '15 at 20:40
  • Maybe straight forward, but maybe an typical XY-problem (http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) – Shnugo Nov 24 '15 at 20:42
  • @BenRubin It is hundreds of millions! Far from inifinite though. – user3791372 Nov 24 '15 at 20:43
  • 1 Int32 = 4 bytes. 100 million Int32s = ~400Mb. There's no reason you should be running out of memory unless you're running this on a pocket calculator. Show your code. – Colin Grealy Nov 24 '15 at 20:45
  • 1
    @ColinGrealy He said they're `long`s, which are 8 bytes. – itsme86 Nov 24 '15 at 20:46
  • Do you need a "real" or at least ("close to real") random shuffle? Do you really need all numbers or just a subset? – Shnugo Nov 24 '15 at 20:47
  • @itsme86 oops, you're right. Still shouldn't be running out of memory though. – Colin Grealy Nov 24 '15 at 20:48
  • @user3791372 Have you considered partitioning your numbers? You could, say, put up to a hundred thousand numbers in one list, and then overflow into the next list, and so on. Then the shuffle would have to exchange between lists as well as indices within the lists, but you could get around the 2MB single object limit that way. – itsme86 Nov 24 '15 at 20:48
  • @user3791372 Well, memory is finite. At 8 bytes per `long`, you're getting into gigabytes of memory if you have hundreds of millions of elements. You could store your entire list on disk, and load portions of it into memory as you sort it. – Ben Rubin Nov 24 '15 at 20:49
  • 2
    More importantly @user3791372, what are you trying to achieve with this list? There's almost certainly a better way to do this. – Colin Grealy Nov 24 '15 at 20:53
  • Reading all this, I come to the following: buy RAM :-) – Shnugo Nov 24 '15 at 22:37
  • Does the list have to be *actually* random or only *appear* random? There are lots of ways to make a sequence of n large numbers that repeats only after all n numbers are used up that are not random and take a small amount of storage but they are not truly random. – Eric Lippert Nov 25 '15 at 06:27

1 Answers1

5

Is there a more efficient, sleeker method for producing an randomly sequenced list of long's, where each number only appears once?

Yes, if you eliminate the requirement that the sequence be truly random. Use the following technique.

Without loss of generality let us suppose that you wish to generate numbers from 0 through n-1 for some n. Clearly you can see how to generate numbers between x and y; just generate numbers from 0 through x-y and then add x to each.

Find a randomly generated number z that is coprime to n. Doing so is left as an exercise to the reader. It will help if the number is pretty large modulo n; the pattern will be easy to notice if z is small modulo n.

Find a randomly generated number m that is between 0 and n-1.

Now generate the sequence (m) * z % n, (m + 1) * z % n, (m + 2) * z % n, and so on. The sequence repeats at (m + n) * z % n; it does not repeat before that. Again, determining why it does not repeat is left as an exercise.

It is easy to see that this is not a true shuffle because there are fewer than n squared possible sequences generated, not the n factorial sequences that are possible with a true shuffle. But it might be good enough for your purposes; if you are using something like System.Random to do randomization you are already abandoning a true shuffle.

I note also that many of the comments suggest that there should be no problem with a large allocation. These comments forget that (1) the relevant measure is not amount of RAM in the box but rather size of the largest contiguous user mode address space block, and that can easily be less than a hundred million bytes in a 32 bit process, (2) that the list data structure intentionally over-allocates, that (3) when the list gets full a copy of the underlying array must be allocated to copy the old list into the new list, which more than doubles the actual memory load of the list, temporarily, and that (4) a user who naively attempts to allocate one hundred-million-byte structure may well attempt to allocate a dozen of them throughout the program. You should always avoid such large allocations; if you have data structures that require large amounts of storage then put them on disk.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • Your solution is effective as long as the generated sequence is not fully stored in memory (i.e. an `IEnumerable` generated through a `yield return`). The original problem still exists when you try to have a list based on this sequence. It is not clear if the OP requires just a sequence or a full list. – Luca Cremonesi Nov 25 '15 at 16:56
  • 1
    @LucaCremonesi: Since every item in the sequence is determined from its ordinal there is no need to keep the entire list in memory. Make an implementation of `IList` whose indexer getter simply calculates the value on the fly when asked. – Eric Lippert Nov 25 '15 at 17:45
  • However this `IList` implementation does not support the full contract, in particular the setter of the Item property, unless you create an internal `Dictionary` to store the new values. – Luca Cremonesi Nov 25 '15 at 18:24
  • @LucaCremonesi: Throw an InvalidOperationException, it is a read-only list, I mean, the IsReadOnly property should return true. – EduardoS Nov 25 '15 at 19:50
  • User address space isn't the only limitation. Even with pre-allocation, x64 can't allocate more than 268M (2^28) due to `OutOfMemoryException: Array dimensions exceeded supported range.` – Brian Nov 25 '15 at 21:40
  • @EduardoS: IsReadOnly specifies if it is possible to add or remove items to/from a Collection. It does not specify if the collection is immutable or not, i.e. if it is possible to mutate an item. From the docs: "For example, the IsReadOnly property of an array that is cast or converted to an `ICollection` object returns true, even though individual array elements can be modified." Instead, I was referring to the indexer setter of the `IList` interface, that allows to mutate the items of the list. – Luca Cremonesi Nov 25 '15 at 21:45
  • @LucaCremonesi: I would have no problem whatsoever with making an `IList` implementation whose setter threw an exception. It would quickly teach those who attempt to set the values that it hurts when they do that, and they'd stop doing that. If that's not acceptable to you then maybe invent a new interface, maybe call it `IReadOnlyList`, and make a class that implements that instead. Regardless, this detail is a distraction from the actual problem which is: does the original poster require true randomness? If so then we have a much harder problem to solve. – Eric Lippert Nov 25 '15 at 21:49
  • @EricLippert: "does the original poster require true randomness? If so then we have a much harder problem to solve." Just out of curiosity, in which direction would you proceed to start solving this harder problem? A Cloud Service that returns the true random sequence? Or a Database? – Luca Cremonesi Nov 25 '15 at 22:16
  • Very interesting solution - thanks. it doesn't have to be random, just as long as I have an even spread across the numbers. – user3791372 Nov 27 '15 at 13:47