3

IN order to find out the substration of two lists in python I use:

names_of_files_not_dowloaded = [item for item in total_files if item not in names_of_files_downloaded]

It works.

the sizes of the lists are:

total number of files 56373 elements

list of files downloaded 28464 elements

it lasts 34 seconds. Somehow I have the intuition that 34 seconds is too long. Is there any way to do this subtraction more efficiently?

thanks

EDIT: an element is something like 'AB12345'

The lists DO NOT HAVE ANY ELEMENT REPEATED, THEY ARE ALREADY SETS

JFerro
  • 3,203
  • 7
  • 35
  • 88
  • Initialise `files_downloaded2 = set(files_downloaded)` and replace `files_downloaded` in the list comp with `files_downloaded2`. – cs95 May 21 '19 at 22:45
  • How large are these files? You could be doing a lot of comparisons if they're of any substantial size. Have you considered hashing the files before doing the comprehension? – Andrew May 21 '19 at 22:46

3 Answers3

4

Just make files_downloaded a set instead of a list. Lists require potentially a full iteration of the list to do a membership check of, each time you want to do a check. Sets however are much more efficient to do a lookup on.

Just use:

downloaded_set = set(files_downloaded)
list_of_files_not_dowloaded = [item for item in total_files if item not in downloaded_set]

This will have an initial cost to put the list into a set, but membership checks afterward will be much faster.


@juanpa.arrivillaga also mentioned in the comments that another cause for the performance hit was in was doing equality checks of the strings, whereas hashes are compared when using Sets, and the latter is much cheaper.

It seems like, if I'm reading the source right, CPython's lists use a straight equality check to do comparisons when checking for membership. Presumably, Sets use hashes, and they're cached at the time of Set creation.

Carcigenicate
  • 43,494
  • 9
  • 68
  • 117
  • astonishing, @carcigenicate, the difference is huuuuge. without set 34 seconds. with set 0.022412776947021484 seconds !!!! – JFerro May 21 '19 at 22:53
  • Yes. O(n) vs O(1) complexity is massive. It really shows with large data sets. With a change that large though, make sure the output is correct and that something isn't going horribly wrong. – Carcigenicate May 21 '19 at 22:54
  • @Berlines Try the other answers too though. I wouldn't be surprised if set subtraction is delegated to optimized C and is even faster than what I have here. – Carcigenicate May 21 '19 at 23:05
  • @Carcigenicate actually, pretty sure it's O(M*N) where M and N are the size of the two lists. Furthermore, really here we are also dealing with the expensiveness of string comparisons, but that gets optimized to a hash when you compare the sets. – juanpa.arrivillaga May 21 '19 at 23:20
  • @juanpa.arrivillaga Nevermind, realized you were referring to my first comment. – Carcigenicate May 21 '19 at 23:26
  • @Carcigenicate no, for the lists. But I believe I misunderstood you, you were referencing the O(N) lookup on lists vs O(1) lookup on sets, but using sets, the full difference algorithm is O(min(M, N)) vs O(M*N) using lists I believe – juanpa.arrivillaga May 21 '19 at 23:26
  • @juanpa.arrivillaga Wow. That's quite the rabbit-hole you sent me down. I realized I didn't know fully how `in` works on lists, so looked over the source. Yes, as linked above, it does appear to use an equality check instead of hashes; at least in the CPython implementation. – Carcigenicate May 21 '19 at 23:55
  • 1
    @Carcigenicate actually, there's a weird optimization: it actually checks object identity first on the assumption that two identical objects should always be equal, and checking identity is always cheap vs checking equality (which can be very expensive). This can lead to weird behavior sometimes, though. – juanpa.arrivillaga May 22 '19 at 00:14
3

If you don't care about the order of elements and your lists don't contain duplicates, you can simply use:

diff = set(total_files) - set(files_downloaded)

If you need the output as a list:

diff = list(set(total_files) - set(files_downloaded))

set overrides the __sub__() method and uses it as a set difference, which is what you are looking for.

As your question says that the lists do not contain dupes and behave like sets, this should get you what you want with relatively good performance.

Andrew
  • 669
  • 7
  • 16
1
total_files_set = set(total_files)
files_downloaded_set = set(files_downloaded)
files_not_dowloaded_set = total_files_set - files_downloaded_set 
list_of_files_not_dowloaded = list(files_not_dowloaded_set)

Or if you want in one line:

list_of_files_not_dowloaded = list(set(total_files) - set(files_downloaded))

To know more about all operations using sets, you can check it here

EDIT:
I've tried timing both methods, using 2 random lists

  • For subset with 50,000 elements and superset with 100,000 elements
timeit.timeit('l = list(set(l1)-set(l2))', 
setup='import random; l1 = random.sample(range(1000000), 100000); l2 = random.sample(range(1000000), 50000)', 
number = 10)

Output:

0.39393879500130424

timeit.timeit('l = [item for item in l1 if item not in l2]', \
setup='import random; l1 = random.sample(range(1000000), 10000); l2 = random.sample(range(1000000), 5000)', \
number = 1)

Output:

98.58012624000003

If you happen to already have both sets, instead of having to convert from list:

timeit.timeit('l = list(s2-s1)', 
setup='import random; s1 = set(random.sample(range(1000000), 100000)); s2 = set(random.sample(range(1000000), 50000))', 
number = 10)

Output:

0.06160322100004123

victortv
  • 7,874
  • 2
  • 23
  • 27