Compute list difference

Question

In Python, what is the best way to compute the difference between two lists?

example

A = [1,2,3,4]
B = [2,5]

A - B = [1,3,4]
B - A = [5]

Does this answer your question? [Get difference between two lists](https://stackoverflow.com/questions/3462143/get-difference-between-two-lists) — mkrieger1, Apr 14 '22 at 12:02

score 458 · Accepted Answer · edited Jan 02 '21 at 10:35

458

If the order does not matter, you can simply calculate the set difference:

>>> set([1,2,3,4]) - set([2,5])
set([1, 4, 3])
>>> set([2,5]) - set([1,2,3,4])
set([5])

edited Jan 02 '21 at 10:35

Mohammed H

6,880
16
81
127

answered Jun 26 '11 at 20:16

phihag

278,196
72
453
469

14

This is by far the best solution. Test case on lists with ~6000 strings each showed that this method was almost 100x faster than list comprehensions. – perrygeo Feb 01 '14 at 17:01
18

Depends on application: if order or duplication preservation is important Roman Bodnarchuk may have a better approach. For speed and pure set-like behavior this one seems better. – Bryan P Feb 13 '15 at 23:01
13

If you have multiple equal elements in list this solution won't work. – karantan Sep 15 '16 at 13:02
Far more better than list comprehension. – Dawei Mar 13 '18 at 09:09
9

This solution seems so obvious but it's incorrect. I'm sorry. Of course we mean that a list can have repeated equal elements. Otherwise we ask about the difference between sets, not about list difference. – sergzach Jun 01 '18 at 10:20
1

You can also use set literals in this case. `{1, 2, 3, 4} - {2, 5}` is `{1, 3, 4}` ` – desertSniper87 Dec 26 '18 at 10:12

Roman Bodnarchuk · Answer 2 · 2016-04-29T11:54:31.740

220

Use set if you don't care about items order or repetition. Use list comprehensions if you do:

>>> def diff(first, second):
        second = set(second)
        return [item for item in first if item not in second]

>>> diff(A, B)
[1, 3, 4]
>>> diff(B, A)
[5]
>>>

edited Apr 29 '16 at 11:54

answered Jun 26 '11 at 20:23

Roman Bodnarchuk

29,461
12
59
75

35

Consider using `set(b)` to ensure the algorithm is O(nlogn) instead of Theta(n^2) – Neil G Jun 27 '11 at 08:47
10

@Pencilcheck - not if you care about ordering or duplication in A. Applying `set` to B is harmless, but applying it to `A` and using the result instead of the original `A` is not. – Mark Reed Aug 31 '14 at 19:09
1

@NeilG Do you consider time consumed to build the set? In my case (both lists have about 10M strings) time to build two sets and subtract them is considerably larger than building one set and iterating over the list. – dimril Jan 19 '18 at 08:19
@dimril if that's what you want to do maybe you should implement something more sophisticated. You could for example sort both lists O(n log n + m log m) and then iterate over the second list but use binary search to find the items in the first list. It would come out to O(n log n + m log m + m log n) operations (instead of O(n*m) operations), which doesn't seem too bad. Just make sure to check for neighbors to also eliminate duplicates in your binary search implementations. There might even be a package that already implements this, but I didn't check. – jaaq Feb 25 '19 at 09:49
Sorry but this solution does not preserve duplication in A since second is never changed. When an element of A matches with an element of B, this element has to be removed in B so that when the same element occurs again in A, it has to be kept if this element appears only once in B. I put a version of diff function in the thread that takes into account the duplication in A. – breakthewall Apr 28 '21 at 14:04

score 73 · Answer 3 · answered Jun 26 '11 at 20:19

73

You can do a

list(set(A)-set(B))

and

list(set(B)-set(A))

answered Jun 26 '11 at 20:19

Senthil Kumaran

54,681
14
94
131

10

But if A = [1,1,1] and B = [0] then this returns [1] – Mark Bell Feb 06 '17 at 19:14
1

@Mark Bell: Thats because a set is a distinct list. (removes duplicates) – cloudy May 01 '17 at 14:44
5

@cloudy Then this doesn't answer the question. – samm82 Dec 23 '19 at 05:27
@samm82 if A=[1,1,1] than set(A) is [1] because set is a distinct list and removes duplicates. Thats why, if A = [1,1,1] and B = [0] it returns [1]. – cloudy May 25 '20 at 11:57

score 36 · Answer 4 · answered Jun 27 '11 at 06:07

36

One liner:

diff = lambda l1,l2: [x for x in l1 if x not in l2]
diff(A,B)
diff(B,A)

Or:

diff = lambda l1,l2: filter(lambda x: x not in l2, l1)
diff(A,B)
diff(B,A)

answered Jun 27 '11 at 06:07

Artsiom Rudzenka

27,895
4
34
52

Kevin · Answer 5 · 2020-08-16T17:17:30.037

19

The above examples trivialized the problem of calculating differences. Assuming sorting or de-duplication definitely make it easier to compute the difference, but if your comparison cannot afford those assumptions then you'll need a non-trivial implementation of a diff algorithm. See difflib in the python standard library.

#! /usr/bin/python2
from difflib import SequenceMatcher

A = [1,2,3,4]
B = [2,5]

squeeze=SequenceMatcher( None, A, B )

print "A - B = [%s]"%( reduce( lambda p,q: p+q,
                               map( lambda t: squeeze.a[t[1]:t[2]],
                                    filter(lambda x:x[0]!='equal',
                                           squeeze.get_opcodes() ) ) ) )

Or Python3...

#! /usr/bin/python3
from difflib import SequenceMatcher
from functools import reduce

A = [1,2,3,4]
B = [2,5]

squeeze=SequenceMatcher( None, A, B )

print( "A - B = [%s]"%( reduce( lambda p,q: p+q,
                               map( lambda t: squeeze.a[t[1]:t[2]],
                                    filter(lambda x:x[0]!='equal',
                                           squeeze.get_opcodes() ) ) ) ) )

Output:

A - B = [[1, 3, 4]]

edited Aug 16 '20 at 17:17

answered Mar 26 '13 at 19:42

Kevin

431
4
11

1

you get +1 for difflib, which I hadn't seen before. nevertheless, I don't agree that the above answers trivialize the problem *as stated*. – rbp Jun 25 '13 at 21:12
Thanks for using difflib - I was looking for a solution using the standard library. However, this is not working in Python 3, as `print` has changed from a command to a function, and `reduce`, `filter` and `map` have been declared unpythonic. (And I think [Guido may be right](https://www.artima.com/weblogs/viewpost.jsp?thread=98196) - I don't understand what `reduce` does, either.) – Post169 Jun 20 '18 at 16:05
Not a big shift to make it work for py3. I've read the debate over filter, map, reduce and agree with the choice to push reduce and and alternate impl of filter into functools. The mixed functional, OO and procedural nature of python has always been, IMO, one of its strengths. – Kevin Jun 20 '18 at 17:43

score 15 · Answer 6 · answered Apr 14 '14 at 14:16

Python 2.7.3 (default, Feb 27 2014, 19:58:35) - IPython 1.1.0 - timeit: (github gist)

def diff(a, b):
  b = set(b)
  return [aa for aa in a if aa not in b]

def set_diff(a, b):
  return list(set(a) - set(b))

diff_lamb_hension = lambda l1,l2: [x for x in l1 if x not in l2]

diff_lamb_filter = lambda l1,l2: filter(lambda x: x not in l2, l1)

from difflib import SequenceMatcher
def squeezer(a, b):
  squeeze = SequenceMatcher(None, a, b)
  return reduce(lambda p,q: p+q, map(
    lambda t: squeeze.a[t[1]:t[2]],
      filter(lambda x:x[0]!='equal',
        squeeze.get_opcodes())))

Results:

# Small
a = range(10)
b = range(10/2)

timeit[diff(a, b)]
100000 loops, best of 3: 1.97 µs per loop

timeit[set_diff(a, b)]
100000 loops, best of 3: 2.71 µs per loop

timeit[diff_lamb_hension(a, b)]
100000 loops, best of 3: 2.1 µs per loop

timeit[diff_lamb_filter(a, b)]
100000 loops, best of 3: 3.58 µs per loop

timeit[squeezer(a, b)]
10000 loops, best of 3: 36 µs per loop

# Medium
a = range(10**4)
b = range(10**4/2)

timeit[diff(a, b)]
1000 loops, best of 3: 1.17 ms per loop

timeit[set_diff(a, b)]
1000 loops, best of 3: 1.27 ms per loop

timeit[diff_lamb_hension(a, b)]
1 loops, best of 3: 736 ms per loop

timeit[diff_lamb_filter(a, b)]
1 loops, best of 3: 732 ms per loop

timeit[squeezer(a, b)]
100 loops, best of 3: 12.8 ms per loop

# Big
a = xrange(10**7)
b = xrange(10**7/2)

timeit[diff(a, b)]
1 loops, best of 3: 1.74 s per loop

timeit[set_diff(a, b)]
1 loops, best of 3: 2.57 s per loop

timeit[diff_lamb_filter(a, b)]
# too long to wait for

timeit[diff_lamb_filter(a, b)]
# too long to wait for

timeit[diff_lamb_filter(a, b)]
# TypeError: sequence index must be integer, not 'slice'

@roman-bodnarchuk list comprehensions function def diff(a, b) seems to be faster.

score 9 · Answer 7 · answered Feb 25 '15 at 02:41

9

A = [1,2,3,4]
B = [2,5]

#A - B
x = list(set(A) - set(B))
#B - A 
y = list(set(B) - set(A))

print x
print y

answered Feb 25 '15 at 02:41

Saksham Varma

2,122
13
15

score 8 · Answer 8 · answered Jun 26 '11 at 20:15

8

You would want to use a set instead of a list.

answered Jun 26 '11 at 20:15

PrettyPrincessKitty FS

6,117
5
36
51

Seperman · Answer 9 · 2015-02-25T01:31:25.817

In case you want the difference recursively going deep into items of your list, I have written a package for python: https://github.com/erasmose/deepdiff

Installation

Install from PyPi:

pip install deepdiff

If you are Python3 you need to also install:

pip install future six

Example usage

>>> from deepdiff import DeepDiff
>>> from pprint import pprint
>>> from __future__ import print_function

Same object returns empty

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = t1
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
    {}

Type of an item has changed

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:"2", 3:3}
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
    {'type_changes': ["root[2]: 2=<type 'int'> vs. 2=<type 'str'>"]}

Value of an item has changed

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:4, 3:3}
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
    {'values_changed': ['root[2]: 2 ====>> 4']}

Item added and/or removed

>>> t1 = {1:1, 2:2, 3:3, 4:4}
>>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes)
    {'dic_item_added': ['root[5, 6]'],
     'dic_item_removed': ['root[4]'],
     'values_changed': ['root[2]: 2 ====>> 4']}

String difference

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world"}}
>>> t2 = {1:1, 2:4, 3:3, 4:{"a":"hello", "b":"world!"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'values_changed': [ 'root[2]: 2 ====>> 4',
                          "root[4]['b']:\n--- \n+++ \n@@ -1 +1 @@\n-world\n+world!"]}
>>>
>>> print (ddiff.changes['values_changed'][1])
    root[4]['b']:
    --- 
    +++ 
    @@ -1 +1 @@
    -world
    +world!

String difference 2

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world!\nGoodbye!\n1\n2\nEnd"}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n1\n2\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'values_changed': [ "root[4]['b']:\n--- \n+++ \n@@ -1,5 +1,4 @@\n-world!\n-Goodbye!\n+world\n 1\n 2\n End"]}
>>>
>>> print (ddiff.changes['values_changed'][0])
    root[4]['b']:
    --- 
    +++ 
    @@ -1,5 +1,4 @@
    -world!
    -Goodbye!
    +world
     1
     2
     End

Type change

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n\n\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'type_changes': [ "root[4]['b']: [1, 2, 3]=<type 'list'> vs. world\n\n\nEnd=<type 'str'>"]}

List difference

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'list_removed': ["root[4]['b']: [3]"]}

List difference 2: Note that it DOES NOT take order into account

>>> # Note that it DOES NOT take order into account
... t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { }

List that contains dictionary:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'dic_item_removed': ["root[4]['b'][2][2]"],
      'values_changed': ["root[4]['b'][2][1]: 1 ====>> 3"]}

score 6 · Answer 10 · answered Nov 30 '17 at 12:53

6

most simple way,

use set().difference(set())

list_a = [1,2,3]
list_b = [2,3]
print set(list_a).difference(set(list_b))

answer is set([1])

answered Nov 30 '17 at 12:53

Mohideen bin Mohammed

18,813
10
112
118

score 2 · Answer 11 · answered Apr 19 '16 at 18:22

In case of a list of dictionaries, the full list comprehension solution works while the set solution raises

TypeError: unhashable type: 'dict'

Test Case

def diff(a, b):
    return [aa for aa in a if aa not in b]

d1 = {"a":1, "b":1}
d2 = {"a":2, "b":2}
d3 = {"a":3, "b":3}

>>> diff([d1, d2, d3], [d2, d3])
[{'a': 1, 'b': 1}]
>>> diff([d1, d2, d3], [d1])
[{'a': 2, 'b': 2}, {'a': 3, 'b': 3}]

Lee James · Answer 12 · 2021-06-23T02:53:48.590

There are 3 options for doing this, two of which are acceptable and one of which should not be done.

The 3 options, at a high level, are:

Subtract two sets (best sometimes)
Check if each list item exists in a set (best most of time)
Check if each list item exists in a list (do not do)

Option 3) should never be chosen over option 2). Depending on the needs of your application, you may prefer option 1) or 2), while 2) is likely the preferred approach in most use cases. 2) is very similar to the performance of 1) since both have O(m + n) time complexity. By contrast, 2) has marginal benefits in space complexity over 1) and maintains both the order of the original list and any duplications in the original list.

If you want to remove duplications and do not care about the order, then 1) is likely the best fit for you.

import time

def fun1(l1, l2):
    # Order and duplications in l1 are both lost, O(m) + O(n)
    return set(l1) - set(l2)

def fun2(l1, l2):
    # Order and duplications in l1 are both preserved, O(m) + O(n)
    l2_set = set(l2)
    return [item for item in l1 if item not in l2_set]

def fun3(l1, l2):
    # Order and duplications in l1 are both preserved, O(m * n)
    # Don't do
    return [item for item in l1 if item not in l2]

A = list(range(7500))
B = list(range(5000, 10000))

loops = 100

start = time.time()
for _ in range(loops):
    fun1(A, B)
print(f"fun1 time: {time.time() - start}")

start = time.time()
for _ in range(loops):
    fun2(A, B)
print(f"fun2 time: {time.time() - start}")

start = time.time()
for _ in range(loops):
    fun3(A, B)
print(f"fun3 time: {time.time() - start}")

fun1 time: 0.03749704360961914
fun2 time: 0.04109621047973633
fun3 time: 32.55076885223389

score 1 · Answer 13 · answered Jan 14 '19 at 09:15

1

Simple code that gives you the difference with multiple items if you want that:

a=[1,2,3,3,4]
b=[2,4]
tmp = copy.deepcopy(a)
for k in b:
    if k in tmp:
        tmp.remove(k)
print(tmp)

answered Jan 14 '19 at 09:15

A.M.

11
1

score 1 · Answer 14 · answered Oct 31 '22 at 10:35

If your order does not matter and both sets can be hashed, you can use a Symmetric Difference across the sets.

This will return the values that appear in either set A or set B, but not both.

For example, the question shows the return of either difference performed on list A and list B.

If we were to (cast both lists to sets and) perform a symmetric difference instead, we would get a merged result of the two in a single operation.

A = [1,2,3,4]
B = [2,5]
print(set(A) ^ set(B)

# {1, 3, 4, 5}

Adding this answer as I have not seen symmetric difference provided in the existing answers yet

score 0 · Answer 15 · answered Aug 08 '20 at 19:53

Adding an answer to take care of the case where we want a strict difference with repetitions, i.e., there are repetitions in the first list that we want to keep in the result. e.g. to get,

[1, 1, 1, 2] - [1, 1] --> [1, 2]

We could use an additional counter to have an elegant difference function.

from collections import Counter

def diff(first, second):
    secondCntr = Counter(second)
    second = set(second)
    res = []
    for i in first:
        if i not in second:
            res.append(i)
        elif i in secondCntr:
            if secondCntr[i] > 0:
                secondCntr[i] -= 1
            else:
                res.append(i)        
    return res

score 0 · Answer 16 · answered Apr 28 '21 at 14:09

I don't see a solution in this thread that preserves duplication in A. When an element of A matches with an element of B, this element has to be removed in B so that when the same element occurs again in A, it has to appear in the difference if this element appears only once in B.

def diff(first, second):
   l2 = list(second)
   l3 = []
   for el in first:
      if el in l2:
         l2.remove(el)
      else:
         l3 += [el]
   return l3

l1 = [1, 2, 1, 3, 4]
l2 = [1, 2, 3, 3]
diff(l1, l2)
>>> [1, 4]

DerKnorr · Answer 17 · 2015-03-13T14:18:53.553

When having a look at TimeComplexity of In-operator, in worst case it works with O(n). Even for Sets.

So when comparing two arrays we'll have a TimeComplexity of O(n) in best case and O(n^2) in worst case.

An alternative (but unfortunately more complex) solution, which works with O(n) in best and worst case is this one:

# Compares the difference of list a and b
# uses a callback function to compare items
def diff(a, b, callback):
  a_missing_in_b = []
  ai = 0
  bi = 0

  a = sorted(a, callback)
  b = sorted(b, callback)

  while (ai < len(a)) and (bi < len(b)):

    cmp = callback(a[ai], b[bi])
    if cmp < 0:
      a_missing_in_b.append(a[ai])
      ai += 1
    elif cmp > 0:
      # Item b is missing in a
      bi += 1
    else:
      # a and b intersecting on this item
      ai += 1
      bi += 1

  # if a and b are not of same length, we need to add the remaining items
  for ai in xrange(ai, len(a)):
    a_missing_in_b.append(a[ai])


  return a_missing_in_b

e.g.

>>> a=[1,2,3]
>>> b=[2,4,6]
>>> diff(a, b, cmp)
[1, 3]

You will never encounter the O(n) worst-case behavior in practice unless the hashing function for your items is broken. (In which case, that's what you should fix.) — GSnyder, Mar 05 '21 at 07:41

Compute list difference

17 Answers17

Installation

Example usage

Test Case

Linked

Related