-1

I am curious as to what implementation would be faster in determining if there is a non-similar element in two lists. Here, both lists will be the same in length, and will only have one element that is not in common.

Implementation #1 :

lista = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
listb = ['a', 'b', 'c', 'd', 'e', 'f', 'gslfkjsjf']
difference = list(set(lista) - set(listb))
>>> ['g']

Implementation #2 :

lista = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
listb = ['a', 'b', 'c', 'd', 'e', 'f', 'gslfkjsjf']
for i in range(len(lista)):
    if (lista[i] != listb[i]):
        print(lista[i])
>>> g

I am interested in knowing the answer as I am trying to find the fastest way to compare two lists of the same length (around 2000 or so, where each element is a unique string. I only used characters for the sake of the example). Thank you to all of those who reply in advance.

Kyle
  • 391
  • 2
  • 5
  • 18
  • Why is this a duplicate? There are many solutions on the internet that give two different answers. – Kyle Apr 16 '19 at 01:37

2 Answers2

0

Here is how to measure from live ipython3 repl

from timeit import timeit  # import timeit

# declare the lists
lista = ['a', 'b', 'c', 'd', 'e', 'f', 'g']        
listb = ['a', 'b', 'c', 'd', 'e', 'f', 'gslfkjsjf']

# measure
timeit('difference = list(set(lista) - set(listb))', globals=globals())

timeit('''for i in range(len(lista)):         
    if (lista[i] != listb[i]):                
        print(lista[i])''', globals=globals())

Here the results was 4.551160663000701 for the loop and 0.851781547000428 for the sets. Remembering that timeit will run 1000000 times by default.

Now why the sets are so much faster? Sets use a hash algorithm instead of indexing. This means that loopkup is much faster for sets because it doens't need to be iterated to find a value. Also the for loop has a print, a range and a comparison, it isn't only slower, is doing much more stuff too.

geckos
  • 5,687
  • 1
  • 41
  • 53
  • The print might be taking some time in the implemetation 2. – St1id3r Apr 16 '19 at 01:27
  • Yeah, but usually stdout is buffered, so the print won't block, it all depends on the output device of course, I said this at _Also the for loop has a print, ..._ – geckos Apr 16 '19 at 01:30
  • I am not accounting for the print in the running time. I am testing it on my end and it seems implementation 2 is faster? – Kyle Apr 16 '19 at 01:31
  • @otterdog Did you try to run it multiple times with different dataset sizes? One test cannot really tell you about speed. – St1id3r Apr 16 '19 at 01:32
  • 1
    yeah without the print the time drops to 0.8392677060001006 which is faster – geckos Apr 16 '19 at 01:33
  • Anybody good with `dis`? – geckos Apr 16 '19 at 01:33
  • @St1id3r I am running it with 1 million strings in the list, and averaging the result of 10 trials. Still seems as if implementation 2 is faster? – Kyle Apr 16 '19 at 01:35
  • Can you post the numbers and related code? – St1id3r Apr 16 '19 at 01:39
0

As per the docs here https://wiki.python.org/moin/TimeComplexity , the set difference of s-t takes O(len(s)) in the best case. Look at https://stackoverflow.com/a/48044412/3236440 .

So, your Implementation #1 takes O(len(lista)) and your Implemenatation #2 takes O(len(lista)) as well since it will run on all the elements of lista.

For 2000 elements it should be the same as they would easily fit on main memory. Also you are mentioning that each element is unique, so there will no collisions on set hashes.

Another key point here is that you always choose the set with lower size for set difference as it will take less runtime.

St1id3r
  • 313
  • 5
  • 12