1

I want to understand the difference in time complexity between these two solutions. The task is not relevant but if you're curious here's the link with the explanation.

This is my first solution. Scores a 100% in correctness but 0% in performance:

def solution(s, p ,q):
    dna = dict({'A': 1, 'C': 2, 'G': 3, 'T': 4})
    result = []
    
    for i in range(len(q)):
        least = 4
        
        for c in set(s[p[i] : q[i] + 1]):
            if least > dna[c]: least = dna[c]
        
        result.append(least)
            
    return result

This is the second solution. Scores 100% in both correctness and performance:

def solution(s, p ,q):
    result = []
    
    for i in range(len(q)):
        if 'A' in s[p[i]:q[i] + 1]: result.append(1)
        elif 'C' in s[p[i]:q[i] + 1]: result.append(2)
        elif 'G' in s[p[i]:q[i] + 1]: result.append(3)
        else: result.append(4)
        
    return list(result)

Now this is how I see it. In both solutions I'm iterating through a range of Q length and on each iteration I'm slicing different portions of a string, with a length between 1 and 100,000.

Here's where I get confused, in my first solution on each iteration, I'm slicing once a portion of the string and create a set to remove all the duplicates. The set can have a length between 1 and 4, so iterating through it must be very quick. What I notice is that I iterate through it only once, on each iteration.

In the second solution on each iteration, I'm slicing three times a portion of the string and iterate through it, in the worst case three times with a length of 100,000.

Then why is the second solution faster? How can the first have a time complexity of O(n*m) and the second O(n+m)?

I thought it's because of the in and the for operators, but I tried the same second solution in JavaScript with the indexOf method and it still gets a 100% in performance. But why? I can understand that if in Python the in and the for operators have different implementations and work differently behind the scene, but in JS the indexOf method is just going to apply a for loop. Then isn't it the same as just doing the for loop directly inside my function? Shouldn't that be a O(n*m) time complexity?

shidoro
  • 362
  • 4
  • 9
  • just a quick comment _"The set can have a length between 1 and 4, so iterating through it must be very quick"_ sure thing, but what about the time required to _create_ the set? – Pac0 Oct 21 '20 at 10:42
  • I know it needs to take time to create it, but it only has to do it once per iteration. Isn't it much slower to iterate through a whole piece of string three times? – shidoro Oct 21 '20 at 10:44
  • In my opinion both solutions are O(q * n). By storing the indexes of As, Cs and Gs in an array you could do bisection to find the index x >= P[i] and if Q[i] >= x the corresponding array is the minimal factor, otherwise continue with next array of indexes. This would be O(n + q * logn). – maraca Oct 21 '20 at 11:46
  • I understand the main idea but I'm not familiar with "performing a bisection".. I'm kind of a rookie and I'm still learning :) I tried to do iterate over the string instead of Q and I did what you're suggesting with x >= P[i] and x <= Q[i].. the problem is that sometimes the length of the string is shorter than the length of Q and that gives me the wrong result.. – shidoro Oct 21 '20 at 13:49

2 Answers2

1

I think the difference in complexity can easily be showcased on an example.

Consider the following input:

s = 'ACGT' * 1000000
# = 'ACGTACGTACGTACGTACGTACGTACGTACGTACGT...ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT'
p = [0]
q = [3999999]

Algorithm 2 very quickly checks that 'A' is in s[0:4000000] (it's the first character - no need to iterate through the whole string to find it!).

Algorithm 1, on the other hand, must iterate through the whole string s[0:4000000] to build the set {'A','C','G','T'}, because iterating through the whole string is the only way to check that there isn't a fifth distinct character hidden somewhere in the string.

Important note

I said algorithm 2 should be fast on this example, because the test 'A' in ... doesn't need to iterate through the whole string to find 'A' if 'A' is at the beginning of the string. However, note a possible important difference in complexity between 'A' in s and 'A' in s[0:4000000]. The problem is that creating a slice of the string might cost time (and memory) if it's copying the string. Instead of slicing, you should use s.find('A', 0, 4000000), which is guaranteed not to build a copy. For more information on this:

Stef
  • 13,242
  • 2
  • 17
  • 28
  • It just explains the difference in performance. However it is not a difference in complexity. – maraca Oct 21 '20 at 12:17
  • @maraca: Thanks for the distinction. It's true that there is no notable difference in worst-case complexity between the two algorithms. There will be a difference in complexity in many cases, though, especially since there is always at most 4 distinct characters. – Stef Oct 21 '20 at 12:19
  • What happens then if the string is `'T' * 1000000`? Would in this case scenario the first solution be faster? Or is the function implementation in C of the `in` operator still faster? You're right about the `find` I didn't think about it. – shidoro Oct 21 '20 at 13:30
  • @speed-o-soundSonic I'm guessing the second solution would almost always be faster, although in the case of `'T' * 1000000` the difference should be marginal. The only way to know for sure is to time it! – Stef Oct 21 '20 at 15:34
  • So it turned out creating the set on each iteration was actually slowing the process down big times. Searching with the `in` operator is indeed a lot faster, even compared to a regex it performs on average 2x as fast and 10x faster for extreme cases. – shidoro Oct 21 '20 at 17:09
1

You haven't specified how the performance rating is obtained, but anyway, the second algorithm is clearly better, mainly because it uses the in operator, that under the hood calls a function implemented in C, which is far more efficient than python. More on this topic here. Also, I'm not sure, but I don't think that the python interpreter isn't smart enough to slice the string only once and then reuse the same portion the other times in the second algorithm. Creating the set in the first algorithm also seems like a very costly operation.

Lastly, maybe the performance ratings aren't based on the algorithm complexity, but rather on the execution time over different test strings?

  • Unfortunately I don't know how Codility rates performances. [First solution](https://app.codility.com/demo/results/trainingWTDPTX-4ZY/), [Second solution](https://app.codility.com/demo/results/trainingWKV2QX-MMJ/). Here are the results of the tests. It says that my first solution kills the time limit of 6s whereas the other completes in less than half a second. It's a very huge difference.. I would have never imagined it was that costly to create a set.. if that's what's actually slowing down the function – shidoro Oct 21 '20 at 13:40
  • It's probably a combination of factors: creating a set, iterate always the whole array, and not using the C implementation of in. – Lorenzo Felletti Oct 21 '20 at 13:44
  • That does solve the Python case.. but I'm still confused to why it works then for JavaScript [solution](https://app.codility.com/demo/results/training7GDGSK-JT9/) which is exactly the same as in Python. Unless the `indexOf` method also calls a function implemented in C which I'm not aware of.. isn't that just as creating another for loop? (Sorry I know the post is actually set for Python and not JS you don't have to answer if you don't want to) – shidoro Oct 21 '20 at 13:55
  • Yeah it could be the extra for loop... Have you tried porting also the first solution to JS? Does they still get a different result? – Lorenzo Felletti Oct 21 '20 at 14:07
  • Yes I tried it now [here](https://app.codility.com/demo/results/trainingNF6PYA-U3W/). So creating a set also in JS results in a much slower computation.. thing is the `indexOf` method does not call any underlying function implemented in C.. I believe? – shidoro Oct 21 '20 at 15:01
  • So it is the set the main factor affecting performances. No, JS indexOf hasn't got an underlying C implementation. – Lorenzo Felletti Oct 21 '20 at 15:08
  • 1
    I tried [this](https://app.codility.com/demo/results/trainingFV4GAZ-XNA/). I implemented a `myIndexOf` method based on the polyfill of the array indexOf (I couldn't find the polyfill of the string indexOf method so I assumed they practically did the same thing), it didn't work. So I tried again [here](https://app.codility.com/demo/results/trainingGFATA3-9EU/), using regex, and it worked. So that's probably where I was wrong. I thought that the string indexOf method performed a for loop but now I believe it uses regex under the hood. That's why the search was much faster. – shidoro Oct 21 '20 at 16:45
  • So yes you were right.. the costly thing was creating the set or the object and not actually searching through it. – shidoro Oct 21 '20 at 16:46