Given an unsorted binary array, count number of 1's, where allowed only to check if an entire subarray is all zeros

Question

Given an unsorted binary array, a, the only allowed operation is all_zeros(a), which returns True iff all of the array's elements are 0.
The complexity of this all_zeros(a) is o(len(a)) + large overhead constant

I would like to find all the indices which contain 1s, in the least runs as possible of all_zeros A reasonable sub problem is to assume the number of 1s is "much" (say, x100~x1000) smaller than the number of 0s.

Theoretically, this is simply solved by iterating over the array element-wise, and testing all_zeros([element]).
In practice, the overhead constant forces us to work in as large batches as possible. We can't assume to know the ratio of 1's in the array, but if some algorithm requires that knowledge, please do share it.

I am looking for a conceptual solution, thus I am not specifying the ratio between the overhead constant and the computation time of all_zeros.

Please notice I am looking for an average case solution, and not for a worst case solution.
This now requires to define a probability distribution over 1's and 0's, but I am trying to keep this at a high level, and I will not go greatly into the details, while still keeping this answerable.
There may be a best case solution, that always gets the minimum overhead. If there is one, it will be accepted.

If all zeroes happen to be isolated, like in 11010101101011011010101, you cannot do better than to call `all_zero` on every individual digit. — trincot, Jun 30 '22 at 10:57
@trincot While this is true, the example violates the problem constraint that the array is (very) sparse in `1` elements. — collapsar, Jun 30 '22 at 12:02
@collaspar, *"violates"* is a very strong word for something that is presented as *"A reasonable sub problem"*, even more so when the asker states *"We can't assume to know the ratio of 1's in the array"*. — trincot, Jun 30 '22 at 12:10
@trincot while both your comments are true, this is "the trivial case", which is also statistically very unreasonable, even assuming uniform distribution on [0, 1]. But moreso, I think it is clear this is not what the question is about. Let's add the assumption that there are always X10 less 1's than 0's. — Gulzar, Jun 30 '22 at 13:16
Can you edit your question? Because that assumption contradicts what you wrote in the question (*"We can't assume..."*). — trincot, Jun 30 '22 at 13:49
@trincot Indeed, *we can't assume*. Still, this specific worst case contradicts the very existence of this question, which clearly doesn't discusses it. It is an edge case which is not likely. We are looking for *the average case*. That I will add. — Gulzar, Jun 30 '22 at 13:56

MrSmith42 · Answer 1 · 2022-06-30T09:50:21.730

I would check big chunks and only try smaller ones if they are not zero. Depending on the ratio of 1s and 'large overhead constant' I would chose a suitable start size.

Here the idea how to check (by example)

The data: (spaces only for readability)

   00001110 00100001 00100000 01000000 00000000 00000000 00000101 01010000
1. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
   -> both checked intervalls are non-zero -> half them

2. xxxxxxxxxxxxxxxxx XXXXXXXXXXXXXXXXX xxxxxxxxxxxxxxxxx XXXXXXXXXXXXXXXXX
      non-zero           non-zero          zero              non-zero

3. xxxxxxxx XXXXXXXX xxxxxxxx XXXXXXXX                   xxxxxxxx XXXXXXXX
     n-z      n-z      n-z      n-z                        n-z      n-z   

4. xxxxXXXX xxxxXXXX xxxxXXXX xxxxXXXX                   xxxxXXXX xxxxXXXX
   zero n-z n-z n-z  n-z zero n-z zero                   zero n-z n-z zero

5.     xxXX xxXXxxXX xxXX     xxXX                           xxXX xxXX 

...

I hope the idea is clear. But I highly recommend to profile with which block-size to start and when to switch for single-element blocks.

Matt Timmermans · Answer 2 · 2022-06-30T14:15:14.220

if all_zeros(a) returns false for some subarray, then you can binary search within that subarray to find the position of the first 1. This process tells you nothing about any elements following that 1, so you would start again after that.

The question is what size to make your initial queries. You will make the fewest total number of queries if the probability of each query returning true is 50%. If your initial query has a 50% chance of finding a 1, then all the queries in the binary search will also have a 50% chance, and the total cost per 1 is log₂ L + 1 queries, if 1s are L slots apart on average.

If L is twice as long as it should be, then or half as long as it should be, then the cost goes up by about 1 query per 1, which is a pretty small price to pay when 1s are far apart.

So a pretty good algorithm that doesn't require knowing the frequency of 1s to start with would be:

Set L=128, say. This is an a priori estimate of 1 frequency.
Check the first L elements. If it's all zero, then multiply L by 2 and continue with the rest of the array.
Otherwise, divide L by 2 if it's > 1, binary search to find the position of the first 1, and continue with the rest of the array after that first 1.

The total cost will be log₂ L + some_small_number of queries per 1, if the 1s are randomly distributed, and I think that's the worst case.

Given an unsorted binary array, count number of 1's, where allowed only to check if an entire subarray is all zeros

2 Answers2

Linked