263

In Python, what is the best way to compute the difference between two lists?

example

A = [1,2,3,4]
B = [2,5]

A - B = [1,3,4]
B - A = [5]
martineau
  • 119,623
  • 25
  • 170
  • 301
Mike
  • 58,961
  • 76
  • 175
  • 221
  • 1
    Does this answer your question? [Get difference between two lists](https://stackoverflow.com/questions/3462143/get-difference-between-two-lists) – mkrieger1 Apr 14 '22 at 12:02

17 Answers17

458

If the order does not matter, you can simply calculate the set difference:

>>> set([1,2,3,4]) - set([2,5])
set([1, 4, 3])
>>> set([2,5]) - set([1,2,3,4])
set([5])
Mohammed H
  • 6,880
  • 16
  • 81
  • 127
phihag
  • 278,196
  • 72
  • 453
  • 469
  • 14
    This is by far the best solution. Test case on lists with ~6000 strings each showed that this method was almost 100x faster than list comprehensions. – perrygeo Feb 01 '14 at 17:01
  • 18
    Depends on application: if order or duplication preservation is important Roman Bodnarchuk may have a better approach. For speed and pure set-like behavior this one seems better. – Bryan P Feb 13 '15 at 23:01
  • 13
    If you have multiple equal elements in list this solution won't work. – karantan Sep 15 '16 at 13:02
  • Far more better than list comprehension. – Dawei Mar 13 '18 at 09:09
  • 9
    This solution seems so obvious but it's incorrect. I'm sorry. Of course we mean that a list can have repeated equal elements. Otherwise we ask about the difference between sets, not about list difference. – sergzach Jun 01 '18 at 10:20
  • 1
    You can also use set literals in this case. `{1, 2, 3, 4} - {2, 5}` is `{1, 3, 4}` ` – desertSniper87 Dec 26 '18 at 10:12
220

Use set if you don't care about items order or repetition. Use list comprehensions if you do:

>>> def diff(first, second):
        second = set(second)
        return [item for item in first if item not in second]

>>> diff(A, B)
[1, 3, 4]
>>> diff(B, A)
[5]
>>> 
Roman Bodnarchuk
  • 29,461
  • 12
  • 59
  • 75
  • 35
    Consider using `set(b)` to ensure the algorithm is O(nlogn) instead of Theta(n^2) – Neil G Jun 27 '11 at 08:47
  • 10
    @Pencilcheck - not if you care about ordering or duplication in A. Applying `set` to B is harmless, but applying it to `A` and using the result instead of the original `A` is not. – Mark Reed Aug 31 '14 at 19:09
  • 1
    @NeilG Do you consider time consumed to build the set? In my case (both lists have about 10M strings) time to build two sets and subtract them is considerably larger than building one set and iterating over the list. – dimril Jan 19 '18 at 08:19
  • @dimril if that's what you want to do maybe you should implement something more sophisticated. You could for example sort both lists O(n log n + m log m) and then iterate over the second list but use binary search to find the items in the first list. It would come out to O(n log n + m log m + m log n) operations (instead of O(n*m) operations), which doesn't seem too bad. Just make sure to check for neighbors to also eliminate duplicates in your binary search implementations. There might even be a package that already implements this, but I didn't check. – jaaq Feb 25 '19 at 09:49
  • Sorry but this solution does not preserve duplication in A since second is never changed. When an element of A matches with an element of B, this element has to be removed in B so that when the same element occurs again in A, it has to be kept if this element appears only once in B. I put a version of diff function in the thread that takes into account the duplication in A. – breakthewall Apr 28 '21 at 14:04
73

You can do a

list(set(A)-set(B))

and

list(set(B)-set(A))
Senthil Kumaran
  • 54,681
  • 14
  • 94
  • 131
36

One liner:

diff = lambda l1,l2: [x for x in l1 if x not in l2]
diff(A,B)
diff(B,A)

Or:

diff = lambda l1,l2: filter(lambda x: x not in l2, l1)
diff(A,B)
diff(B,A)
Artsiom Rudzenka
  • 27,895
  • 4
  • 34
  • 52
19

The above examples trivialized the problem of calculating differences. Assuming sorting or de-duplication definitely make it easier to compute the difference, but if your comparison cannot afford those assumptions then you'll need a non-trivial implementation of a diff algorithm. See difflib in the python standard library.

#! /usr/bin/python2
from difflib import SequenceMatcher

A = [1,2,3,4]
B = [2,5]

squeeze=SequenceMatcher( None, A, B )

print "A - B = [%s]"%( reduce( lambda p,q: p+q,
                               map( lambda t: squeeze.a[t[1]:t[2]],
                                    filter(lambda x:x[0]!='equal',
                                           squeeze.get_opcodes() ) ) ) )

Or Python3...

#! /usr/bin/python3
from difflib import SequenceMatcher
from functools import reduce

A = [1,2,3,4]
B = [2,5]

squeeze=SequenceMatcher( None, A, B )

print( "A - B = [%s]"%( reduce( lambda p,q: p+q,
                               map( lambda t: squeeze.a[t[1]:t[2]],
                                    filter(lambda x:x[0]!='equal',
                                           squeeze.get_opcodes() ) ) ) ) )

Output:

A - B = [[1, 3, 4]]
Kevin
  • 431
  • 4
  • 11
  • 1
    you get +1 for difflib, which I hadn't seen before. nevertheless, I don't agree that the above answers trivialize the problem *as stated*. – rbp Jun 25 '13 at 21:12
  • Thanks for using difflib - I was looking for a solution using the standard library. However, this is not working in Python 3, as `print` has changed from a command to a function, and `reduce`, `filter` and `map` have been declared unpythonic. (And I think [Guido may be right](https://www.artima.com/weblogs/viewpost.jsp?thread=98196) - I don't understand what `reduce` does, either.) – Post169 Jun 20 '18 at 16:05
  • Not a big shift to make it work for py3. I've read the debate over filter, map, reduce and agree with the choice to push reduce and and alternate impl of filter into functools. The mixed functional, OO and procedural nature of python has always been, IMO, one of its strengths. – Kevin Jun 20 '18 at 17:43
15

Python 2.7.3 (default, Feb 27 2014, 19:58:35) - IPython 1.1.0 - timeit: (github gist)

def diff(a, b):
  b = set(b)
  return [aa for aa in a if aa not in b]

def set_diff(a, b):
  return list(set(a) - set(b))

diff_lamb_hension = lambda l1,l2: [x for x in l1 if x not in l2]

diff_lamb_filter = lambda l1,l2: filter(lambda x: x not in l2, l1)

from difflib import SequenceMatcher
def squeezer(a, b):
  squeeze = SequenceMatcher(None, a, b)
  return reduce(lambda p,q: p+q, map(
    lambda t: squeeze.a[t[1]:t[2]],
      filter(lambda x:x[0]!='equal',
        squeeze.get_opcodes())))

Results:

# Small
a = range(10)
b = range(10/2)

timeit[diff(a, b)]
100000 loops, best of 3: 1.97 µs per loop

timeit[set_diff(a, b)]
100000 loops, best of 3: 2.71 µs per loop

timeit[diff_lamb_hension(a, b)]
100000 loops, best of 3: 2.1 µs per loop

timeit[diff_lamb_filter(a, b)]
100000 loops, best of 3: 3.58 µs per loop

timeit[squeezer(a, b)]
10000 loops, best of 3: 36 µs per loop

# Medium
a = range(10**4)
b = range(10**4/2)

timeit[diff(a, b)]
1000 loops, best of 3: 1.17 ms per loop

timeit[set_diff(a, b)]
1000 loops, best of 3: 1.27 ms per loop

timeit[diff_lamb_hension(a, b)]
1 loops, best of 3: 736 ms per loop

timeit[diff_lamb_filter(a, b)]
1 loops, best of 3: 732 ms per loop

timeit[squeezer(a, b)]
100 loops, best of 3: 12.8 ms per loop

# Big
a = xrange(10**7)
b = xrange(10**7/2)

timeit[diff(a, b)]
1 loops, best of 3: 1.74 s per loop

timeit[set_diff(a, b)]
1 loops, best of 3: 2.57 s per loop

timeit[diff_lamb_filter(a, b)]
# too long to wait for

timeit[diff_lamb_filter(a, b)]
# too long to wait for

timeit[diff_lamb_filter(a, b)]
# TypeError: sequence index must be integer, not 'slice'

@roman-bodnarchuk list comprehensions function def diff(a, b) seems to be faster.

Moreno
  • 1,567
  • 1
  • 12
  • 15
9
A = [1,2,3,4]
B = [2,5]

#A - B
x = list(set(A) - set(B))
#B - A 
y = list(set(B) - set(A))

print x
print y 
Saksham Varma
  • 2,122
  • 13
  • 15
8

You would want to use a set instead of a list.

PrettyPrincessKitty FS
  • 6,117
  • 5
  • 36
  • 51
6

In case you want the difference recursively going deep into items of your list, I have written a package for python: https://github.com/erasmose/deepdiff

Installation

Install from PyPi:

pip install deepdiff

If you are Python3 you need to also install:

pip install future six

Example usage

>>> from deepdiff import DeepDiff
>>> from pprint import pprint
>>> from __future__ import print_function

Same object returns empty

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = t1
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
    {}

Type of an item has changed

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:"2", 3:3}
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
    {'type_changes': ["root[2]: 2=<type 'int'> vs. 2=<type 'str'>"]}

Value of an item has changed

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:4, 3:3}
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
    {'values_changed': ['root[2]: 2 ====>> 4']}

Item added and/or removed

>>> t1 = {1:1, 2:2, 3:3, 4:4}
>>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes)
    {'dic_item_added': ['root[5, 6]'],
     'dic_item_removed': ['root[4]'],
     'values_changed': ['root[2]: 2 ====>> 4']}

String difference

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world"}}
>>> t2 = {1:1, 2:4, 3:3, 4:{"a":"hello", "b":"world!"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'values_changed': [ 'root[2]: 2 ====>> 4',
                          "root[4]['b']:\n--- \n+++ \n@@ -1 +1 @@\n-world\n+world!"]}
>>>
>>> print (ddiff.changes['values_changed'][1])
    root[4]['b']:
    --- 
    +++ 
    @@ -1 +1 @@
    -world
    +world!

String difference 2

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world!\nGoodbye!\n1\n2\nEnd"}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n1\n2\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'values_changed': [ "root[4]['b']:\n--- \n+++ \n@@ -1,5 +1,4 @@\n-world!\n-Goodbye!\n+world\n 1\n 2\n End"]}
>>>
>>> print (ddiff.changes['values_changed'][0])
    root[4]['b']:
    --- 
    +++ 
    @@ -1,5 +1,4 @@
    -world!
    -Goodbye!
    +world
     1
     2
     End

Type change

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n\n\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'type_changes': [ "root[4]['b']: [1, 2, 3]=<type 'list'> vs. world\n\n\nEnd=<type 'str'>"]}

List difference

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'list_removed': ["root[4]['b']: [3]"]}

List difference 2: Note that it DOES NOT take order into account

>>> # Note that it DOES NOT take order into account
... t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { }

List that contains dictionary:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'dic_item_removed': ["root[4]['b'][2][2]"],
      'values_changed': ["root[4]['b'][2][1]: 1 ====>> 3"]}
Seperman
  • 4,254
  • 1
  • 28
  • 27
6

most simple way,

use set().difference(set())

list_a = [1,2,3]
list_b = [2,3]
print set(list_a).difference(set(list_b))

answer is set([1])

Mohideen bin Mohammed
  • 18,813
  • 10
  • 112
  • 118
2

In case of a list of dictionaries, the full list comprehension solution works while the set solution raises

TypeError: unhashable type: 'dict'

Test Case

def diff(a, b):
    return [aa for aa in a if aa not in b]

d1 = {"a":1, "b":1}
d2 = {"a":2, "b":2}
d3 = {"a":3, "b":3}

>>> diff([d1, d2, d3], [d2, d3])
[{'a': 1, 'b': 1}]
>>> diff([d1, d2, d3], [d1])
[{'a': 2, 'b': 2}, {'a': 3, 'b': 3}]
joao
  • 4,067
  • 3
  • 33
  • 37
2

There are 3 options for doing this, two of which are acceptable and one of which should not be done.

The 3 options, at a high level, are:

  1. Subtract two sets (best sometimes)
  2. Check if each list item exists in a set (best most of time)
  3. Check if each list item exists in a list (do not do)

Option 3) should never be chosen over option 2). Depending on the needs of your application, you may prefer option 1) or 2), while 2) is likely the preferred approach in most use cases. 2) is very similar to the performance of 1) since both have O(m + n) time complexity. By contrast, 2) has marginal benefits in space complexity over 1) and maintains both the order of the original list and any duplications in the original list.

If you want to remove duplications and do not care about the order, then 1) is likely the best fit for you.

import time

def fun1(l1, l2):
    # Order and duplications in l1 are both lost, O(m) + O(n)
    return set(l1) - set(l2)

def fun2(l1, l2):
    # Order and duplications in l1 are both preserved, O(m) + O(n)
    l2_set = set(l2)
    return [item for item in l1 if item not in l2_set]

def fun3(l1, l2):
    # Order and duplications in l1 are both preserved, O(m * n)
    # Don't do
    return [item for item in l1 if item not in l2]

A = list(range(7500))
B = list(range(5000, 10000))

loops = 100

start = time.time()
for _ in range(loops):
    fun1(A, B)
print(f"fun1 time: {time.time() - start}")

start = time.time()
for _ in range(loops):
    fun2(A, B)
print(f"fun2 time: {time.time() - start}")

start = time.time()
for _ in range(loops):
    fun3(A, B)
print(f"fun3 time: {time.time() - start}")
fun1 time: 0.03749704360961914
fun2 time: 0.04109621047973633
fun3 time: 32.55076885223389
Lee James
  • 91
  • 1
  • 5
1

Simple code that gives you the difference with multiple items if you want that:

a=[1,2,3,3,4]
b=[2,4]
tmp = copy.deepcopy(a)
for k in b:
    if k in tmp:
        tmp.remove(k)
print(tmp)
A.M.
  • 11
  • 1
1

If your order does not matter and both sets can be hashed, you can use a Symmetric Difference across the sets.

This will return the values that appear in either set A or set B, but not both.

For example, the question shows the return of either difference performed on list A and list B.

If we were to (cast both lists to sets and) perform a symmetric difference instead, we would get a merged result of the two in a single operation.

A = [1,2,3,4]
B = [2,5]
print(set(A) ^ set(B)

# {1, 3, 4, 5}

Adding this answer as I have not seen symmetric difference provided in the existing answers yet

WebWanderer
  • 10,380
  • 3
  • 32
  • 51
0

Adding an answer to take care of the case where we want a strict difference with repetitions, i.e., there are repetitions in the first list that we want to keep in the result. e.g. to get,

[1, 1, 1, 2] - [1, 1] --> [1, 2]

We could use an additional counter to have an elegant difference function.

from collections import Counter

def diff(first, second):
    secondCntr = Counter(second)
    second = set(second)
    res = []
    for i in first:
        if i not in second:
            res.append(i)
        elif i in secondCntr:
            if secondCntr[i] > 0:
                secondCntr[i] -= 1
            else:
                res.append(i)        
    return res
Maaverik
  • 161
  • 3
0

I don't see a solution in this thread that preserves duplication in A. When an element of A matches with an element of B, this element has to be removed in B so that when the same element occurs again in A, it has to appear in the difference if this element appears only once in B.

def diff(first, second):
   l2 = list(second)
   l3 = []
   for el in first:
      if el in l2:
         l2.remove(el)
      else:
         l3 += [el]
   return l3

l1 = [1, 2, 1, 3, 4]
l2 = [1, 2, 3, 3]
diff(l1, l2)
>>> [1, 4]
breakthewall
  • 183
  • 6
-1

When having a look at TimeComplexity of In-operator, in worst case it works with O(n). Even for Sets.

So when comparing two arrays we'll have a TimeComplexity of O(n) in best case and O(n^2) in worst case.

An alternative (but unfortunately more complex) solution, which works with O(n) in best and worst case is this one:

# Compares the difference of list a and b
# uses a callback function to compare items
def diff(a, b, callback):
  a_missing_in_b = []
  ai = 0
  bi = 0

  a = sorted(a, callback)
  b = sorted(b, callback)

  while (ai < len(a)) and (bi < len(b)):

    cmp = callback(a[ai], b[bi])
    if cmp < 0:
      a_missing_in_b.append(a[ai])
      ai += 1
    elif cmp > 0:
      # Item b is missing in a
      bi += 1
    else:
      # a and b intersecting on this item
      ai += 1
      bi += 1

  # if a and b are not of same length, we need to add the remaining items
  for ai in xrange(ai, len(a)):
    a_missing_in_b.append(a[ai])


  return a_missing_in_b

e.g.

>>> a=[1,2,3]
>>> b=[2,4,6]
>>> diff(a, b, cmp)
[1, 3]
DerKnorr
  • 78
  • 5
  • You will never encounter the O(n) worst-case behavior in practice unless the hashing function for your items is broken. (In which case, that's what you should fix.) – GSnyder Mar 05 '21 at 07:41