13

This question was inspired by an answer I was working on yesterday.

Let's say we have N inputs that evaluate to either true or false, what is the most efficient way to determine if X of those inputs are true?

Caveats:

  1. The inputs are not in an array, so if you convert them to an array please account for any overhead costs.
  2. By "most efficient" I mean in terms of best average case (although I would love to see best and worst case stats too).

Here are two of the methods I came across yesterday.

1) Think of the variables as Boolean inputs for a circuit and reduce them using a K-map

At first I thought this would be the most efficient means because it follows circuit logic, but I've definitely had second thoughts. As the number of inputs increases, the number of comparisons goes up exponentially

2 inputs:
   1 of 2: if(1 OR 2)
   2 of 2: if(1 AND 2)

3 inputs:
   1 of 3: if(1 OR 2 OR 3)
   2 of 3: if((1 AND 2) OR (1 AND 3) OR (2 AND 3))
   3 of 3: if(1 AND 2 AND 3)

4 inputs:
   1 of 4: if(1 OR 2 OR 3 OR 4)
   2 of 4: if((1 AND 2) OR (1 AND 3) OR (1 AND 4) OR (2 AND 3) OR (2 AND 4) OR (3 AND 4))
   3 of 4: if((1 AND 2 AND 3) OR (1 AND 2 AND 4) OR (1 AND 3 AND 4) OR (2 AND 3 AND 4))
   4 of 4: if(1 AND 2 AND 3 AND 4)

... etc. ...

The best case scenario is fine (O(1)), but the worst case is far worse than...

2) A counter and sequential if statements

This performs in O(n) time always, which is OK, but I was hoping for a better best case.

counter = 0

if(input 1)
   counter++

if(input 2)
   counter++

if(input 3)
   counter++

... etc. ...

if(counter >= X)
   // true

What is a more efficient solution than either of these?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Swift
  • 13,118
  • 5
  • 56
  • 80
  • The question that inspired this if anyone is interested: http://stackoverflow.com/q/6687557/385913 – Swift Jul 14 '11 at 15:00
  • 4
    "The inputs are not in an array, so if you convert them to an array please account for any overhead costs. " So... what form ARE they in? Surely they aren't all named variables. If they are I/O ports that can only be accessed one by one, then that's probably more expensive than anything else, and counting is the only way. If they're bits packed in a variable, a world of options arise. – Ed Staub Jul 14 '11 at 15:18
  • In the question that inspired my post they were all named variables, but I'd love to see an answer using a variable packed with bits. – Swift Jul 14 '11 at 15:25
  • 1
    [Best algorithm to count the number of set bits in a 32-bit integer?][1]. [1]: http://stackoverflow.com/questions/109023/best-algorithm-to-count-the-number-of-set-bits-in-a-32-bit-integer – Ed Staub Jul 14 '11 at 15:45
  • The problem is best-case `omega(n)`. Proof: every solution is `omega(X)` and since `X` is `O(n)` no solution will be faster than best-case `omega(n)`. It follows that average running time is `omega(n)`. We conclude that the simple counter solution is optimal. – davin Jul 14 '11 at 15:53
  • @davin I disagree, check out my answer bellow. If you use nested if statements you can achieve a best case of O(x) and retain a worse case of O(n). The trade off is readability and code length. – Swift Jul 14 '11 at 16:26
  • Yeah, I really don't like that solution. The modified-counter is best-case `O(X)` as well so you didn't really gain anything, and again, since `X` can be anything until `n` that's a best-case `O(n)` in general. There doesn't exist a worst-case sub-linear solution. – davin Jul 14 '11 at 16:32

7 Answers7

15

On the problem's complexity
Since an exact count is requested (as opposed to say asking if at least x inputs are on), the problem is very clearly O(n) :

  • one needs to visit every single input, and,
  • the work for each input is independent from n (while the work for a given input may vary based on the particular value of the input, the amount of work [for this input] doesn't vary if the number of inputs is changed.)

We can certainly implement sub-optimal algorithms which, for example, would [unnecessarily] visit all other inputs as each input is being processed, making it an O(n^2) implementation, but that is of course silly.

This being asserted, the questions is therefore probably about...
tricks which would make the implementation faster
It should be noted that while such tricks may exist, the complexity of the algorithm/problem remains stubbornly at O(n).

Trick 1: Better storage for inputs
Unfortunately, the question indicates that the inputs come from named variables and that the cost for any conversion of the input [for the sake of allowing faster counting] would have to be taken into account for the evaluation of the overall performance of the algorithm. Although this eventually depends on the underlying language, runtime etc., the need to account for the conversion cost very likely condemns any algorithm based on alternate storage to be slower than solutions which keep the inputs as-is.

Trick 2: short-circuit the evaluation
The idea is to return false as soon (or shortly after) as either

  • the running count of inputs that are on is bigger than X the number (or, if we are counting the number of input that are off, when this count exceeds (n - X))
  • the number of inputs left to test plus the running count of inputs that are on is less that X. (or something similar in the case of counting the off inputs).

This trick is relatively straight forward, but the extra cost for computing the values needed in the early exit tests may offset the gains made by [statically] exiting early.

Trick 3: use reverse logic: count the number of inputs that are off rather than these are are on. (or count both).
The cost/benefits of this approach depends on both the number of on input to test for (the X of the question) and on the statistical prior we may have about the inputs (is the number on on-inputs at a given time relatively uniformly distibuted or do we tend to have only a few inputs on (or off)).

The solution proposed by Chris Acheson provides a baseline for the use of both Trick 2 and Trick 3. Assuming that we could make a few assumptions about the distribution of the inputs' state, additional performance improvements to this baseline would be driven such "priors": some quick heuristics done prior to the counting of the inputs per se would determine whether which state we should count (on or off or both), which limit we should test for etc. and branch to the corresponding "versions" of the algorithm.

Additional gains are also possible, if we know the individual probability of a given input to be on or off, as we'd then test for the most (or least) likely ones first, to quickly get to our "short circuit value".

On the best-case/worse-case "complexity" of these optimized algorithms
Assuming that

  • the number of inputs that are on at a given time is uniformly distributed
  • all inputs have a 50% change of being on at a given time
  • X is randomly selected between 1 and n

A combination of Tricks #2 and #3 could be O(X/2) on average (I need to do the math, but that seems right). However I think it wiser to talk in terms of number of operations relative to X and/or n, rather than misusing the O-notation...
Assuming that all operations roughly incur the same cost

  • Initialize a counter or variable
  • Test an input or a variable
  • addition or subtraction
  • etc

It is easier and also more exact to compute the total number of operations needed for a given algorithm to complete, and hence to use such counts, for various best/worse/average cases to help decide on specific algorithms.
To illustrate this, a naive implementation that merely would systematically count all on-inputs and only compare the counter at the end, would be of O(n) complexity and complete in all cases in roughly 1 + 2*n + 1 operations. Such an algorithm may prove to be overall ,better than a fancier algorithm which while being, say, O(X), O((X+n)/2) and O(n) in the best, average and worse cases, respectively, may well use X*3, (X+n)* 1.5, and n*3 operations in these same cases.

Community
  • 1
  • 1
mjv
  • 73,152
  • 14
  • 113
  • 156
  • 1
    Given the first solution presented by the original author, it's fair to assume that "X or more out of N inputs are true". – RHSeeger Jul 15 '11 at 15:06
  • It's not completely true that in order to check if exactly X inputs are true one needs to check _every_ input. Example: if X == 1, and the first two inputs are true, we may stop further checking. So the _average_ number of checks can be below _N_. For the _worst case_, of course, all the N checks are needed. – Vlad Jul 16 '11 at 14:51
  • @Vlad: It's important to distinguish between average complexity that is dependent on the distribution of inputs, and average complexity that isn't. For example, 3-way quicksort with random pivots is average case O(n*log(n)) for any input, whereas if I feed in a lot of inputs with no true values, the average number of checks your program makes will always be n. – Clueless Jul 16 '11 at 22:20
11

This version will be efficient for values of X close to either zero or N:

true_counter = 0
false_counter = 0
max_false = N - X

if(input 1)
   true_counter++
   if(counter >= X)
      return true
else
   false_counter++
   if(false_counter > max_false)
      return false

// and so on
Chris Acheson
  • 983
  • 1
  • 7
  • 11
  • Well done, Chris, a very generic way of short-circuiting the logic. Attention: I think the question is about finding if there are exactly X inputs on, and therefore your first short circuit test should be `if (counter > X) return false` – mjv Jul 14 '11 at 17:17
  • I went with "X or more inputs", since that's what the OP's example code did. In the case of "exactly X inputs", your way works, and we only return true if we get to the very end. – Chris Acheson Jul 14 '11 at 18:03
4

Counting the trues is the fastest way. Your own counter will allow more than X to be true, but the question implies you wanted a specific value - not a big deal, but if you want at least 10 (but more is OK) then you can check the counter after each increment of counter and abort early.

On another note, if the flags are packed into a word there are faster ways of counting the 1's. In the end, counting 1's is the way to go. BTW in C False is zero and True is 1, so you can literally add them together to count the Trues.

phkahler
  • 5,687
  • 1
  • 23
  • 31
  • The adding method fails when the values we are checking are numeric. (ex. if we are checking that more than x out of n ints evaluate to false, but perhaps one of them is equal to 2 which throws off the math) – Swift Jul 14 '11 at 15:15
  • 1
    @Mike: You can use `y != 0` to collapse nonzero `y` values to `1`. – hammar Jul 14 '11 at 15:18
  • Would be faster to count the appropriate values to `min(X,n-X)` – davin Jul 14 '11 at 15:42
4

In programming languages where boolean values are simply a signed char with a value of 1 or 0, we could simply do this :

if (input1 + input2 + input3 + input4 + ... + inputX > n)
  • 1
    This falls apart when you're dealing with numeric inputs and falsey values, otherwise I think this would be my favorite. – Swift Jul 15 '11 at 13:54
3

There can be no sublinear algorithm for the general case: you have to look at each input. So counting is just fine.

Average case runtime depends on your average cases. Chris idea to stop counting as soon as the count determines the outcome will help in many cases.

Beyond that it really comes down to appropriate datastructures. You asked about bitfields: http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive

ian
  • 674
  • 1
  • 6
  • 11
2

The answer given by mjv covered sequential algorithms pretty thoroughly, but because of your reference to circuit logic I feel that parallel and physical algorithms need to be lightly addressed as well.

If you had N inputs to a digital circuit (or cell processor) you can count them in O(log n) time by recursively adding pairs and propagating the result:

[1, 1, 0, 1, 1, 1, 0, 1] -> [(1+1), (0+1), (1+1), (0+1)] -> [(2+1), (2+1)] -> [(3+3)]

This gives us N additions, but they are parallelizable into log(2,N) generations (assuming you have enough processors/adders to run N/2 operations simultaneously..)

There are variations on this algorithm to take advantage of the thresholding problem requirement, but they mostly won't be worthwhile unless the threshold is expected to be very low (10 out of 14000 inputs, for example).

nevinera
  • 63
  • 4
  • I like this method, but I'll make the same point I made before. Adding the values falls apart when you're dealing with numeric data and falsey values. – Swift Jul 15 '11 at 13:53
  • Well then, just convert true values to 1 and false values to 0 as part of the first round of additions. – Clueless Jul 16 '11 at 22:22
0
import Data.List (foldl')
main :: IO ()
xoutofn :: Num a => (a1 -> Bool) -> [a1] -> a
xoutofn pred ns = foldl' test 0 (map pred ns) where 
                    test x (True)  = x+1
                    test x (False) = x

main = print $ xoutofn predicate [1 .. 1000] where
    predicate x = x > 500

Could do the short circuiting that people above do where it stops as soon as it finds x, but I like the simplicity of this version.

The running time is O(n) because converting each item in the list to a boolean happens as needed (because of laziness), so it only needs to traverse the list once.

$ time ./xoutofn
500

real    0m0.003s
user    0m0.000s
sys     0m0.000s
Wes
  • 2,100
  • 1
  • 17
  • 31