17

This questions is basically a follow-on from my answer here. I really wanted to say what the Big-O of this algorithm would be, but I wasn't sure my claim was completely sound.

So given two arrays:

B = [ "Hello World!", "Hello Stack Overflow!", "Foo Bar!", "Food is nice...", "Hej" ]
A = [ "World", "Foo" ]

what is the Big O of:

List<string> results = new List<string>();
foreach (string test in B)
{
   if (A.Any(a => test.Contains(a))
      results.Add(test);
}

I believe it to be somewhere between O(n) and O(n^2) as it depends where in the result the Any() matches...

Community
  • 1
  • 1
Liam
  • 27,717
  • 28
  • 128
  • 190
  • that's what I *believe* it to be, but I'm really not 100%. – Liam Jul 12 '16 at 15:19
  • 3
    I'm pretty sure it's still considered an O(n^2) algorithm even if it can break early. – itsme86 Jul 12 '16 at 15:22
  • I guess the question here is what is `n` in this case, the number of items in `A` or in `B`? If it is just one of those 2 the complexity would be `O(n)` – heijp06 Jul 12 '16 at 15:22
  • 1
    @heijp06 It's both of them since the algorithm iterates over both. Hence O(n^2) – itsme86 Jul 12 '16 at 15:24
  • @itsme86 Wouldn't that only be the case if the number of items in `A` depended linearly on the number of items in `B` (or vice versa)? – heijp06 Jul 12 '16 at 15:26
  • 3
    @heijp06 Big-O notation doesn't take into consideration individual data sets. It's a high level efficiency evaluation of an algorithm. Nested iterations without optimizations (e.g. binary search) are O(n^2). – itsme86 Jul 12 '16 at 15:28
  • 1
    Since the input is constant, so is the algorithmic complexity – John Dvorak Jul 12 '16 at 18:47
  • It's not really related to the question, but I feel the need to point out that this could be written as `List results = B.Where(b => A.Any(b.Contains)).ToList();`. Also if there's a chance the results might not be needed it's better to store an enumerable than to write the results to a list so you're not wasting time calculating results that are just going to be discarded. – Pharap Jul 13 '16 at 00:06
  • 2
    @itsme86 The complexity here is `O(mn)` with `m` the length of `B` and `n` the length of `A`. it can only be `O(n^2)` if `m = cn` for sufficiently large `n`, with `c` some constant independent of `m` and `n`. We then have `m = cn` and `O(mn) = O(cn^2) = O(n^2)`. But there is nothing in the problem statement that indicates that that simplification can be made. – heijp06 Jul 13 '16 at 06:22
  • **Hint**: What's n? Your question is ill-defined. The time complexity of your algorithm depends on several variables: the lengths of the two arrays, the lengths of the strings in them. Defining the problem clearly will help you reason about it. – Colonel Panic Jul 13 '16 at 09:58

5 Answers5

23

Let length of A be N and length of B be M. We have two extremal cases:

  1. The worst one:

    a => test.Contains(a) 
    

    returns false on every a, so A.Any has to scan the entire A and we have

    O(N * M) 
    
  2. The best one:

    a => test.Contains(a) 
    

    returns true on the very 1st item of A and thus A.Any returns immediately and we have only

    O(M)
    

Actual complexity is somewhere in between (both borders included):

    [O(M)..O(M*N)]
Dmitry Bychenko
  • 180,369
  • 20
  • 160
  • 215
  • 17
    To be clear: best case complexity is O(M), worst case complexity is O(M*N) and average case complexity is O(M*(N/2)) = O(M*N) – Jasmijn Jul 12 '16 at 16:04
  • 19
    @Robin: When taking on *average case* we must know the *distribution* (i.e. average of what); the question doesn't provide any hint of it. You've solved the case when each item of `A` is somewhere in `B` (all items of `B` are distinct) and the distribution of `A` items in `B` is uniform – Dmitry Bychenko Jul 12 '16 at 16:12
  • 1
    That's fair enough. – Jasmijn Jul 12 '16 at 16:20
  • 6
    The runtime of `.Contains` is not trivial to the question. It should be mentioned in this answer. – Adam Martin Jul 12 '16 at 21:41
  • 1
    Echoing @AdamMartin here. You've treated `Contains()` as if it's always an O(n) operation and that is only its best case. OP had the answer right to begin with. – RubberDuck Jul 13 '16 at 01:04
11

It's close, however as others mention it's going to be O(n*m) as the sizes of each of your collections differ (with the best case being O(n) and worst being O(n*m)) otherwise, if you were iterating the same size collection twice, you'd get O(n^2).

Taking a Look Behind the Scenes at Any()

You can take a look at the source for the Enumerable.Any() method to dig into this a bit further. You'll see a foreach loop to iterate through until it finds a match for the predicate which reinforces your assumption of it being O(n):

public static bool Any<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate) {
   if (source == null) throw Error.ArgumentNull("source");
   if (predicate == null) throw Error.ArgumentNull("predicate");
   // Another loop to iterate through the collection until the predicate is matched
   foreach (TSource element in source) {
        if (predicate(element)) return true;
   }
   return false;
}

As you can see Any() by itself is going to be O(n) for it's length and nesting that inside of an existing loop O(m) should give you O(n*m) for the entire code. However, it's possible that it could be as low as O(m).

Rion Williams
  • 74,820
  • 37
  • 200
  • 327
5

.Any() should be O(n) as it searches the container until it finds the first matching instance. So that, in a foreach loop should be O(n^2).

Allan Pereira
  • 2,572
  • 4
  • 21
  • 28
S.Ahl
  • 69
  • 5
5

I assume you provide A and B just for example of their contents, and you are aware that complexity has meaning only for sets of inputs (like average, worst and best cases), not for individual inputs.

My point is, depending on the problem requirements and use case, you can make very different estimates for the code complexity.

Let n be A.Length and m be B.Length. Complexity of the given code then can be calculated in a few different ways:

  1. Assume string.Contains is O(1). In practice, such strong assumption can be made, for example, if we know for sure that no string is longer than some fixed in advance length. Then code complexity is of course O(n*m).

  2. Assume string.Contains is O(x*y), where x and y are lengths of haystack and needle. Let the longest string in A be of length k_A and the longest string in B be of length k_B. Then code complexity would be O(n*m*k_A*k_B)

For the second case, there is another (more practical) approach:

Let S_A be the sum of lengths for all strings in A, and S_B be the sum of length for all strings in B. Then code complexity would be O(S_A * S_B).

That's it for the worst case. For the average case, though, and for the most practical cases, string.Contains is O(x + y). So average code complexity would be O(n*m*(k_A + k_B)) or O((S_B + k_A) * n + (S_A + k_B) * m).

0

it's essentially a nested for loop, so the big O should be O(n^2) for worst case scenario