efficiently mask-out exactly 30% of array with 1M entries

Question

My question's header is similar to this link, however that one wasn't answered to my expectations.

I have an array of integers (1 000 000 entries), and need to mask exactly 30% of elements. My approach is to loop over elements and roll a dice for each one. Doing it in a non-interrupted manner is good for cache coherency.

As soon as I notice that exactly 300 000 of elements were indeed masked, I need to stop. However, I might reach the end of an array and have only 200 000 elements masked, forcing me to loop a second time, maybe even a third, etc.

What's the most efficient way to ensure I won't have to loop a second time, and not being biased towards picking some elements?

Edit:

//I need to preserve the order of elements.
//For instance, I might have:

 [12, 14, 1, 24, 5, 8]

//Masking away 30% might give me:
 [0,  14, 1, 24, 0, 8]

The result of masking must be the original array, with some elements set to zero

Start with those 1000000 entries. Pick one at random. That's the first one of your 300000. Remove it from the array. 999999 entries left. Pick a randomly chosen one from the remaining entries. That's the second one of your 300000. Repeat the same exact process 299998 more times. Very simple. The end result is you'll end up with exactly 300000 randomly chosen elements. Now, "mask" them, whatever that means. You're done. — Sam Varshavchik, Feb 28 '18 at 03:03
Or, you could randomly order it, then pick the first 300,000. — gsquaredxc, Feb 28 '18 at 03:06
If you're going to walk the array from the start, you need to change the probability that each element will be "masked" as you mask (or not) the previous ones. The probability would be number-of-elements-remaining-to-mask / number-of-elements-remaining. — 1201ProgramAlarm, Feb 28 '18 at 03:08
I'd probably shuffle and lop off the end if you don't need them in the original order. Otherwise, what @SamVarshavchik said. — Retired Ninja, Feb 28 '18 at 03:08
For @SamVarshavchik example, you could just run `for(somecode){ rand()%(300000-i)} — gsquaredxc, Feb 28 '18 at 03:09
@SavVarshavchik I would want to avoid this, it will require me to prepare a clone-array, so I can remove elements from it. By masking I mean setting to zero. I also need to preserve the order. — Kari, Feb 28 '18 at 03:20
@1201ProgramAlarm Good approach, except I don't want to divide, it's a slow operation, might take me several extra microseconds on huge arrays. — Kari, Feb 28 '18 at 03:21
@GrantGarrison I need to preserve the order of elements, so shuffling is not an option. Unless I shuffle a collection of their indices, then select first 300 000. But this creates unnecesary allocations & extra pre-pass (shuffling). — Kari, Feb 28 '18 at 03:22
@Kari. One million is a tiny number (8MB worth of doubles). You keep talking about all sorts of premature optimizations but it's clear you haven't actually tried running any code. Are you really worried if your code takes half a second to run instead of 0.45sec? — Mad Physicist, Feb 28 '18 at 03:28
I am interested in implementation, so 0.05 seconds is important in this case. Also, why people start to -1 the question... The issue was explained, the reference to similar post was provided, examples were given. Just because of personal reasons or that I have 16 rep... — Kari, Feb 28 '18 at 03:33
Kari, people are unlikely to downvote you for personal reasons since they almost certainly don't know you and you haven't done anything offensive. Also unlikely for having low rep, it's usually the opposite since "newbies" are given some leeway. Much more likely is that the question is considered "not useful" in that it won't help others in future - you provided little reasoning as to why micro-optimising matters in this case - are you *likely* to be doing this so many times per second that it matters? :-) — paxdiablo, Feb 28 '18 at 03:48
Point is, did you try an implementation and find it inadequate, or are you just worried? — Mad Physicist, Feb 28 '18 at 06:26

Pubby · Answer 1 · 2018-02-28T03:35:08.190

3

Just do a fisher-yates shuffle but stop at only 300000 iterations. The last 300000 elements will be the randomly chosen ones.

std::size_t size = 1000000;
for(std::size_t i = 0; i < 300000; ++i)
{
    std::size_t r = std::rand() % size;
    std::swap(array[r], array[size-1]);
    --size;
}

I'm using std::rand for brevity. Obviously you want to use something better.

The other way is this:

for(std::size_t i = 0; i < 300000;)
{
    std::size_t r = rand() % 1000000;
    if(array[r] != 0)
    {
        array[r] = 0;
        ++i;
    }
}

Which has no bias and does not reorder elements, but is inferior to fisher yates, especially for high percentages.

edited Feb 28 '18 at 03:35

answered Feb 28 '18 at 03:24

Pubby

51,882
13
139
180

I need to preserve the order of elements, so shuffling is not an option. Unless I shuffle a collection of their indices, then select first 300 000. But this creates unnecesary allocations & extra pre-pass (shuffling) of O(n) – Kari Feb 28 '18 at 03:26
The result of masking must be the original array, with some elements set to zero – Kari Feb 28 '18 at 03:27
@Kari I posted an alternative, but I don't think there's much else to be done if you can't change your data structure. – Pubby Feb 28 '18 at 03:35
I have to agree that Fisher-Yates (with either indexes or a copy of the array if the original must be left untouched) is preferable. The percentage doesn't have to be *that* high for problems to start occurring, see the birthday problem for why. – paxdiablo Feb 28 '18 at 03:39

Aiken Drum · Answer 2 · 2018-03-02T01:03:56.710

When I see a massive list, my mind always goes first to divide-and-conquer.

I won't be writing out a fully-fleshed algorithm here, just a skeleton. You seem like you have enough of a clue to take decent idea and run with it. I think I only need to point you in the right direction. With that said...

We'd need an RNG that can return a suitably-distributed value for how many masked values could potentially be below a given cut point in the list. I'll use the halfway point of the list for said cut. Some statistician can probably set you up with the right RNG function. (Anyone?) I don't want to assume it's just uniformly random [0..mask_count), but it might be.

Given that, you might do something like this:

// the magic RNG your stats homework will provide
int random_split_sub_count_lo( int count, int sub_count, int split_point );

void mask_random_sublist( int *list, int list_count, int sub_count )
{
    if (list_count > SOME_SMALL_THRESHOLD)
    {
        int list_count_lo = list_count / 2;               // arbitrary
        int list_count_hi = list_count - list_count_lo;

        int sub_count_lo = random_split_sub_count_lo( list_count, mask_count, list_count_lo );
        int sub_count_hi = list_count - sub_count_lo;

        mask( list,                list_count_lo, sub_count_lo );
        mask( list + sub_count_lo, list_count_hi, sub_count_hi );
    }
    else
    {
        // insert here some simple/obvious/naive implementation that
        // would be ludicrous to use on a massive list due to complexity,
        // but which works great on very small lists.  I'm assuming you 
        // can do this part yourself.
    }
}

Assuming you can find someone more informed on statistical distributions than I to provide you with a lead on the randomizer you need to split the sublist count, this should give you O(n) performance, with 'n' being the number of masked entries. Also, since the recursion is set up to traverse the actual physical array in constantly-ascending-index order, cache usage should be as optimal as it's gonna get.

Caveat: There may be minor distribution issues due to the discrete nature of the list versus the 30% fraction as you recurse down and down to smaller list sizes. In practice, I suspect this may not matter much, but whatever person this solution is meant for may not be satisfied that the random distribution is truly uniform when viewed under the microscope. YMMV, I guess.

paxdiablo · Answer 3 · 2018-02-28T03:26:49.687

Here's one suggestion. One million bits is only 128K which is not an onerous amount.

So create a bit array with all items initialised to zero. Then randomly select 300,000 of them (accounting for duplicates, of course) and mark those bits as one.

Then you can run through the bit array and, any that are set to one (or zero, if your idea of masking means you want to process the other 700,000), do whatever action you wish to the corresponding entry in the original array.

If you want to ensure there's no possibility of duplicates when randomly selecting them, just trade off space for time by using a Fisher-Yates shuffle.

Construct an collection of all the indices and, for each of the 700,000 you want removed (or 300,000 if, as mentioned, masking means you want to process the other ones) you want selected:

pick one at random from the remaining set.
copy the final element over the one selected.
reduce the set size.

This will leave you with a random subset of indices that you can use to process the integers in the main array.

Good, but randomly selecting 300 000 of them is the part I need to figure out - it has to be efficient & no duplicate indices — Kari, Feb 28 '18 at 03:19
@Kari, then you need a Fisher-Yates shuffle. Have updated answer to show how to do that. — paxdiablo, Feb 28 '18 at 03:30

score 0 · Answer 4 · answered Feb 28 '18 at 03:56

0

You want reservoir sampling. Sample code courtesy of Wikipedia:

(*
  S has items to sample, R will contain the result
 *)
ReservoirSample(S[1..n], R[1..k])
  // fill the reservoir array
  for i = 1 to k
      R[i] := S[i]

  // replace elements with gradually decreasing probability
  for i = k+1 to n
    j := random(1, i)   // important: inclusive range
    if j <= k
        R[j] := S[i]

answered Feb 28 '18 at 03:56

Stephen Newell

7,330
1
24
28

Thank you, but it will result in many cache misses & O(k^2) complexity – Kari Feb 28 '18 at 04:04
Am I missing something? Performance would be O(N) to build the reservoir, and O(N) to operate on the data in the reservoir. O(2N) isn't ideal, but you said you needed to remove exactly 300,000 elements and this does so in a fair manner. – Stephen Newell Feb 28 '18 at 04:14
You are right, this would be linear complexity, it's a good solution, but not cache friendly – Kari Feb 28 '18 at 04:24

efficiently mask-out exactly 30% of array with 1M entries

4 Answers4