Fastest way of putting array elements number of occurrences between limits

Question

For example...

I have an array of integers which is initialized by random values between 1 and 1000
Array has 1M elements (it will probably have much more, but this is only the example)
Number of each element's occurrences must be between 10 and 1010

What's the fastest way of adjusting this Array's elements so they would meet the mentioned criteria?

My first solution is simply too slow if max number of occurrences is close to array.Length (1M)/valuesSpan (1000)

I tried something like (This is only for aligning max of occurences, solution for the lower limit is almost the same):

Int64[] DistinctArrayElements = distinctArrayElements;
Dictionary<Int64, Int32> occurrences = new Dictionary<Int64, Int32>();

foreach (Int64 DistinctElement in DistinctArrayElements)
{
    occurrences.Add(DistinctElement, 0);
}

foreach (Int64 ArrayElement in Arr)
{
    occurrences[ArrayElement] += 1;
}
//I know this initialization can be done more nicely, so don't bother with this.

for (int j = 0; j < Arr.Length; j++)
{
    if (occurrences[Arr[j]] > upperNoOfOccurrences)
    {
        for (int i = 0; i < Arr.Length; i++)        
        {
            if (occurrences[Arr[i]] < upperNoOfOccurrences)
            {
                Arr[j] = Arr[i];
                occurrences[Arr[i]] += 1;
                occurrences[Arr[j]] -= 1;
            }
        }
    }
}

What will you do if your array is larger than 1000*1010? Then at least one of your thousand unique elements will occur more than 1010 times. — Kevin, Oct 25 '11 at 13:59
That will be handled on the input form by data validation. That situation will not be possible. — Eedoh, Oct 25 '11 at 14:00
I will be validating data on my input form, so the situation Kevin is referring to is not gonna be possible. — Eedoh, Oct 25 '11 at 14:50

Andreas · Answer 1 · 2011-10-25T15:41:26.787

I can't quite make a real meaning out of what you want to do. However, it seems to be a waste to process the array this many times. You could all do with just one looping (and a little bit of forward peeking when you run out of "free" unique values). The following is certainly not the nicest code I have written, but I think it poses a solution to your problem.

HashSet<long> forbidden = new HashSet<long>(); // maximum size of 1000, contains values that exceeded the limit
Queue<long> remaining = new Queue<long>(1000); // stores found unique values within the limit in a queue, that will be used if we bounce into the limit
Dictionary<long, int> frequencies = new Dictionary<long, int>(1000);
int lastPeekIndex = 0;
for (int i = 0; i < Arr.Length; i++) {
  if (!frequencies.ContainsKey(Arr[i])) {
    frequencies[Arr[i]] = 0;
    remaining.Add(Arr[i]);
  }

  if (frequencies[Arr[i]] == upperLimit) {
    if (!forbidden.Contains(Arr[i])) forbidden.Add(Arr[i]);
    var next = Int64.MinValue;
    try {
      next = remaining.Dequeue();
      while (forbidden.Contains(next)) next = remaining.Dequeue();
    } catch (InvalidOperationException) { // Arrr! we have not yet observed enough unique values
      for (int j = Math.Max(i, lastPeekIndex) + 1; j < Arr.Length; j++)
        if (!frequencies.ContainsKey(Arr[j])) {
          frequencies[Arr[j]] = 0;
          next = Arr[j];
          lastPeekIndex = j;
        }
    }
    Arr[i] = next;
    frequencies[next]++;
    if (frequencies[next] < upperLimit) remaining.Enqueue(next);
  } else frequencies[Arr[i]]++;
}

Note that this does not check on the lower limit, since you didn't check this either. I think you have to care about values that occur not often enough in a second pass. You can put them in another queue after the first pass and then iterate over the array again and again until the queue is empty (probably even less than a single full iteration is necessary in the second pass).

score 0 · Answer 2 · answered Oct 25 '11 at 16:07

I would sort your dictionary so that the numbers which appear less are first. This way you don't have to look everytime for a suitable number, just substitute it with the one with less occurrences. Here's a pseudocode to explain that:

struct dict {
    key, value
}

linkedList<dict> occurrences;

initialize occurrences
sort it (smallest values first)

// start from the one with greatest number of occurrences
n = occurrences.last;

// keep going until occurrences of n is greater than upperNoOfOccurrences
while n.value.value > upperNoOfOccurrences and didn't reach first element
    repeat = true

    do:
        // required occurrences to subtract to be within the limit
        required = upperNoOfOccurrences - n.value.value

        // maximum occurrences we can add to the first
        maxAllowed = upperNoOfOccurrences - occurrences.first.value.value

        // if we can do that
        if required < maxAllowed:
            occurrences.first.value.value += required
            n.value.value -= required
            repeat = false
        else:    // n.value.value is still greater than upperNoOfOccurrences
            occurrences.first.value.value += maxAllowed 
            n.value.value -= maxAllowed 
            repeat = true
        end if

        // keep occurrences sorted
        newPos = occurrences.first.next
        while occurrences.first.value > newPos.value.value:
            newPos = newPos.next

        move occurrences.first before newPos
    while repeat
end while

now rebuild your array with occurrences. it will
be sorted  but it doesn't matter does it? ;)

score 0 · Answer 3 · edited May 23 '17 at 12:27

Here's a simple and exact method that uniformly samples the sets of numbers that meet your criteria.

Let M = number of distinct values; N = number of array elements; L = lower limit on count of instances of each value; U = upper limit on count; D = U-L. For your example, M=1000, N=1000000, L=10, U=1010, and D=1000.
Create array S, of size M*D. Set the first N entries of S to 1, and the rest to zero.
Shuffle S via Fisher–Yates shuffle. (See links here)
Create array T, of size M.
For i up to M, set T[i] = L + S[iD] + S[iD+1] + ... + S[i*D+D-1].
Create array V, of size N. Fill it with T[0] instances of the 0'th value and so forth, i.e. with T[i] instances of the i'th value, for each i. Because S contained N 1's, V will be filled completely and exactly.
Shuffle V via Fisher–Yates shuffle. Then array V satisfies the original criteria.

Note that steps 2-5 are O(MD), while 6-7 are O(N+M), the latter being as good as can be achieved, and the former probably likewise, since MD is O(N) in your problem statement.

Fastest way of putting array elements number of occurrences between limits

3 Answers3