compare two lists to get non matching elements

Question

I checked this comparing lists, Only one answer is relative to what I am trying to do. I have to lists with some similiar elements, I want to get the non -matching elements.

len(h) = 1973182  #h[0] = 'B00006J8F4F2', y[0] = 'B0075Y2X2GO6'
len(y) = 656890

I am doing

new_list = [i for i in h if i not in y],however this takes about 13 minutes to do, Is there a faster way of doing this?

In refer to "duplicate" question, Finding elements not in a list, I use the same code, What I am looking for is a faster way of doing it.

Possible duplicate of [Finding elements not in a list](https://stackoverflow.com/questions/2104305/finding-elements-not-in-a-list) — Sayse, May 31 '19 at 14:43
Just to clarify - by "non-matching elements" you mean things in the first list that aren't in the second? Or thing that aren't in either? — doctorlove, May 31 '19 at 14:48
@doctorlove things in the first list that aren't in the second — programmerwiz32, May 31 '19 at 14:50
Okay my approach should work fine in that case @programmerwiz32 — yatu, May 31 '19 at 14:55
Updated the answer @programmerwiz32 here `sets` with `sorted` is performing up to 200 times faster — yatu, May 31 '19 at 15:18

yatu · Accepted Answer · 2019-05-31T15:15:14.707

2

You can use sets to more efficiently find the difference between both lists. If you need to keep the order in the original list you can use sorted with a key.

We want to sort the elements in the set according to their appearance in the original list, so one way is to build a lookup dictionary. We can use enumerate for that. Then we only need to lookup on the dictionary as a key function:

d = {j:i for i,j in enumerate(h)}
new_list  = sorted(list((set(h) - set(y))), key = lambda x: d[x])

Let's try with a simple example:

y = range(5)
h = range(7)
d = {j:i for i,j in enumerate(h)}
sorted(list((set(h) - set(y))), key = lambda x: d[x])
# [5, 6]

Timings -

import random
y = random.sample(range(1, 10001), 10000)
h = random.sample(range(1, 20001), 10000)

%timeit [i for i in h if i not in y]
# 1.28 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def using_sets(a,b):
    d = {j:i for i,j in enumerate(a)}
    sorted(list((set(a) - set(b))), key = lambda x: d[x])

%timeit using_sets(h,y)
# 6.16 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So there's a clear improvement, with the proposed approach performing up to 200 times faster.

edited May 31 '19 at 15:15

answered May 31 '19 at 14:41

yatu

86,083
12
84
139

set(h)-set(y) will give element of h which is not in h, maybe OP requrire this for both list – sahasrara62 May 31 '19 at 14:46
This will also remove any duplicates that exist within `h` i.e `h = [2, 2]` – Sayse May 31 '19 at 14:47
1

The element in h which is not in h? This gives the set difference. Which is a more efficient way of doing `...if i not in y` – yatu May 31 '19 at 14:47
Yes that is a good point @Sayse it will if there are duplicates. Lets ask – yatu May 31 '19 at 14:48
1

Well aparently elements in the list are unique @sayse thanks for pointing out though – yatu May 31 '19 at 14:52
Your first link points to the deprecated `sets` module. The documentation for the newer, built-in `set` type is [here](https://docs.python.org/2/library/stdtypes.html#set). – user200783 May 31 '19 at 15:10
does your update indicate that my initial method is faster ? – programmerwiz32 May 31 '19 at 15:14
No @programmerwiz32 mine is 200 times faster. So its `1.28s/6.16ms = 207` – yatu May 31 '19 at 15:19
@yatu i got just a little faster thna yours 6.16/5.44=1.13 – sahasrara62 May 31 '19 at 15:25
Nice @prashantrana :) Nice soln using defaultdict – yatu May 31 '19 at 15:26
Gettiing 1.04 in my machine. So its safe to say that performance-wise they're about the same @prashantrana – yatu May 31 '19 at 15:35
@yatu yes performance waise ,same , maybe machine dependent and I am using python3.7 – sahasrara62 May 31 '19 at 15:36
@yatu just check out a little more improve solution 6.16/2.75 = 2.24 – sahasrara62 May 31 '19 at 15:50

doctorlove · Answer 2 · 2019-05-31T15:07:00.840

The answer you linked to suggests using sets, because they use hashes to look thing up quickly. With lists, and in, like

new_list = [i for i in h if i not in y]

the whole of list y needs checking each time for each i in h.

You could use sets, but as has been pointed out need to be careful with duplicates getting lost.

You could use a Counter:

from collections import Counter

the with two lists, say

l1 = [1,1,2,3,4]
l2 = [3,3,4,5,6]

for examples' sake, can use fed into a Counter each

>>> Counter(l1)
Counter({1: 2, 2: 1, 3: 1, 4: 1})
>>> Counter(l2)
Counter({3: 2, 4: 1, 5: 1, 6: 1})

This just walks each list once. Subtracting them gives what's in the first but not the second:

>>> Counter(l1)-Counter(l2)
Counter({1: 2, 2: 1})

The elements tell you what you want

>>> diff = Counter(l1)-Counter(l2)
>>> list(diff.elements())
[1, 1, 2]

sahasrara62 · Answer 3 · 2019-05-31T15:49:44.907

0

using programmatically and keep order and handle duplicate in list1

def function(list1, list2):
    dic2={}   
    for i in list2:
        try:
            if i in dic2.keys():
                pass
        except KeyError:
            dic2[i]=1           

    result =[]
    for i in list1:
        try:
            if i in dic2.keys():
                pass
        except:
            result.append(i)
    return result



list1=[1,2,2,3]
list2=[3,4,5]

solution = function(list1,list2)
print(solution)

output

[1, 2, 2]

using @yatu h,y list, here is time result

%timeit function(h,y)
2.75 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited May 31 '19 at 15:49

answered May 31 '19 at 15:02

sahasrara62

10,069
3
29
44

how is this faster than `new_list = sorted((set(h) - set(y)), key = h.index)` ? – programmerwiz32 May 31 '19 at 15:06
@programmerwiz32 because i didn't sort the object which will took O(nlogn) time and then do calculation to preserve index, here i just hashed value of list2 in dictionary so O(1) time to access it , and O(n) TIME to go through list1, so complexity becomes `O(n)`, – sahasrara62 May 31 '19 at 15:14
@programmerwiz32 see current one solution , check speed and if you find it good one than accept and upvote – sahasrara62 May 31 '19 at 15:56

Alain T. · Answer 4 · 2019-05-31T18:21:12.350

0

You can use the Counter class from collections:

list1 = [1,1,2,3,4]
list2 = [3,3,4,5,6]

from collections import Counter
result = list((Counter(list1)-Counter(list2)).elements())

# [1, 1, 2]

Or, if you want mutual exclusion:

count1 = Counter(list1)
count2 = Counter(list2)
r = list((count1-count2+(count2-count1)).elements()) 

# [1, 1, 2, 3, 5, 6]

edited May 31 '19 at 18:21

answered May 31 '19 at 18:08

Alain T.

40,517
4
31
51

compare two lists to get non matching elements

4 Answers4