0

I have an 8 million list of unique integers representing IDs of, lets say books. The thing is this list changes every semester (erased IDs, new IDs). Using only list comprehension to get a new list of "new ids" or "erased ids" takes forever.

I did try two operations to look for the two previously described items (erased IDs, new IDs)

l1 = [1,2,3,4,5]
l2 = [0,2,3,4,6,7]

new_ids = [x for x in l2 if x not in l1]
erased_ids = [x for x in l1 if x not in l2]

Is there a parallel way to process these comparisons to get a better performance?

  • 5
    Try representing the lists as sets, then use set operations on them – cs95 Apr 23 '19 at 21:00
  • 3
    Why aren't you using a database for this? – chepner Apr 23 '19 at 21:00
  • Possible duplicate of [Fastest way to check if a value exist in a list](https://stackoverflow.com/questions/7571635/fastest-way-to-check-if-a-value-exist-in-a-list) – srknzl Apr 23 '19 at 21:07

1 Answers1

3

You could try to do this with multiprocessing, but that's not going to help you much as it will only cut the time to compute the answer in half. You said it takes forever, and forever / 2 is still forever. You need a different algorithm. Try sets

set1 = set(l1)
set2 = set(l2)

new_ids = list(set2 - set1)
erased_ids = list(set1 - set2)

Your algorithm runs in O(n^2). This is because [x for x in l2 if x not in l1] checks all of l1 for an x, for every x in l2. If l1 and l2 have 8m elements, that requires 8000000^2 = 160000000000000 checks.

Instead, a set is a data structure (which uses hashing internally) that can check for element membership in one operation, or O(1). In theory, checking if a number x is in a set takes the same amount of time whether the set has one element or 8 million. This is not true for a list.

Sets can also be subtracted. set2 - set1 means "the things in set2 that aren't in set1". This is done in O(n) time, I presume by doing n O(1) checks for membership.

Building the sets in the first place is also O(n) time, as adding to a set is an O(1) operation and you must do it to n elements.

Therefore, this whole algorithm runs in O(n) time.

Tom Lubenow
  • 1,109
  • 7
  • 15
  • Can you explain, how does sets find these internally? I mean what is algorithm/process internally and time complexity? – vb_rises Apr 23 '19 at 21:05
  • 1
    sets use hashing internally to obtain linear runtimes for most normal operations. I edited my answer, let me know if you still have questions. – Tom Lubenow Apr 23 '19 at 21:16
  • @TomLubenow can't believe how fast was that. Thanks!! – Claudio Paredes Apr 23 '19 at 21:24
  • But you can do this in O(n) time without building sets if the lists are sorted. This is just one of the forms of a merge join. In this case it can be made to in one pass extract both the left and right outer sides. – Dan D. Apr 23 '19 at 21:38
  • @TomLubeknow Thanks for the explanation. It's clear now. – vb_rises Apr 23 '19 at 21:52