593

Let's say I have two lists, l1 and l2. I want to perform l1 - l2, which returns all elements of l1 not in l2.

I can think of a naive loop approach to doing this, but that is going to be really inefficient. What is a pythonic and efficient way of doing this?

As an example, if I have l1 = [1,2,6,8] and l2 = [2,3,5,8], l1 - l2 should return [1,6]

APerson
  • 8,140
  • 8
  • 35
  • 49
fandom
  • 5,931
  • 2
  • 15
  • 3
  • 27
    Just a tip: [PEP8](https://www.python.org/dev/peps/pep-0008/#names-to-avoid) states that lowercase "L" should not be used because it looks too much like a 1. – spelchekr Jun 10 '15 at 19:43
  • 3
    I agree. I read this whole question and the answers wondering why people kept using eleven and twelve. It was only when I read @spelchekr 's comment that it made sense. – robline Aug 21 '19 at 17:15
  • 1
    Possible duplicate of [dropping rows from dataframe based on a "not in" condition](https://stackoverflow.com/questions/27965295/dropping-rows-from-dataframe-based-on-a-not-in-condition) – Jim G. Sep 11 '19 at 18:34
  • 3
    @JimG. Dataframe and list is not the same thing. – reducing activity Nov 06 '19 at 10:10
  • 1
    This question is not well defined. Lists allow duplicate items. Should [1, 1, 2, 3] - [1, 2] return [1, 3] or just [3]? The solutions below all seem to to assume that it should return [3], that all members which match an element of the second list should be removed from the first list. You sometimes need list operations to obey algebraic properties. If A - B = C, then C + B = A. That isn't the case here, since duplicate values are lost. – Dan J. Oct 14 '21 at 21:03
  • See also: [How to find list intersection?](/questions/3697432/how-to-find-list-intersection). (This question is for list *difference*, which is equivalent to removing the intersection from the first input.) – Karl Knechtel Aug 11 '22 at 07:39

13 Answers13

755

Python has a language feature called List Comprehensions that is perfectly suited to making this sort of thing extremely easy. The following statement does exactly what you want and stores the result in l3:

l3 = [x for x in l1 if x not in l2]

l3 will contain [1, 6].

NoDataDumpNoContribution
  • 10,591
  • 9
  • 64
  • 104
Donut
  • 110,061
  • 20
  • 134
  • 146
  • 20
    Very pythonic; I like it! How efficient is it? – fandom Nov 18 '10 at 02:51
  • 6
    I believe quite efficient, and it has the benefit of being extremely readable and clear as to what you're trying to accomplish. I came across a blog post you might find interesting relating to efficiency: http://blog.cdleary.com/2010/04/efficiency-of-list-comprehensions/ – Donut Nov 18 '10 at 02:55
  • 13
    @fandom: the list comprehension itself is quite efficient (although a generator comprehension might be more efficient by not duplicating elements in memory), but the `in` operator isn't that efficient on a list. `in` on a list is O(n), whereas `in` on a set is O(1). However, until you get to thousands of elements or more, you're unlikely to notice the difference. – Daniel Pryden Nov 18 '10 at 03:10
  • 2
    `l3 = [x for x in l1 if x not in set(l2)]` ? I am sure if `set(l2)` would be called more than once. – Danosaure Nov 18 '10 at 06:30
  • @Danosaure: yes, `set(l2)` would be called for each element of `l1`. In general, if you're going to be using `in` a lot, you probably shouldn't be using a list (or tuple) to begin with (though if you know it'll always be really small it wouldn't be terrible). – Laurence Gonsalves Nov 18 '10 at 07:19
  • for LARGE lists where one wants the `O(1)`ness of sets but doesn't want to call `set` again and again, one can use `[x for l2s in [set(l2)] for x in l1 if x not in l2s]` – gboffi Nov 20 '14 at 23:41
  • 6
    You could also just set `l2s = set(l2)` and then say `l3 = [x for x in l1 if x not in l2s]`. Slightly easier. – spelchekr Jun 10 '15 at 19:40
  • If you are able to use `set` at all, you should just use [Arkku's answer](http://stackoverflow.com/a/4211239/130427). It's both more efficient and readable. I couldn't use sets in my application because my items aren't hashable. – Chris Redford Jun 26 '15 at 20:52
  • FYI, in `l3 = [x for x in l1 if x not in set(l2)]` ... the `set(l2)` part is only run once, so this should be the most efficient, readable answer that doesn't remove duplicates or reorder the list. – Eric Lindauer Sep 23 '15 at 13:00
  • 1
    @EricLindauer: if you replace `x not in set(l2)` with `print(x)`, you'll see every element of `l1` printed to stdout, i.e. the `print` function will be called on every iteration. Same holds for `set(l2)`. – vaultah Aug 13 '17 at 13:28
  • this would behave weird for l1 = [1, 2, 2, 3, 3, 3] and l2 = [2, 3, 3]. It would return [1]. And we would probably expect [1, 2, 3] – Andrei Prãdan Jan 27 '21 at 20:43
  • Beware the list size! This solution becomes very expensive for large lists. It scales way worse than subtracting two sets. – joba2ca Feb 09 '23 at 17:21
239

One way is to use sets:

>>> set([1,2,6,8]) - set([2,3,5,8])
set([1, 6])

Note, however, that sets do not preserve the order of elements, and cause any duplicated elements to be removed. The elements also need to be hashable. If these restrictions are tolerable, this may often be the simplest and highest performance option.

Arkku
  • 41,011
  • 10
  • 62
  • 84
  • 75
    This will also remove duplicates from `l1`, which may be an undesired side effect. – kindall Nov 18 '10 at 04:43
  • 53
    ..and lose element order (if order is important). – Danosaure Nov 18 '10 at 06:27
  • 8
    I just want to add that I timed this vs. the accepted answer and it was more performant by a factor of about 3: `timeit.timeit('a = [1,2,3,4]; b = [1,3]; c = [i for i in a if a not in b]', number=100000) -> 0.12061533199999985` `timeit.timeit('a = {1,2,3,4}; b = {1,3}; c = a - b', number=100000) -> 0.04106225999998969`. So if performance is a significant factor, this answer may be more appropriate (and also if you don't care about duplicates or order) – wfgeo Jul 15 '19 at 08:32
  • Faster but not in existing order – hongkail Oct 17 '20 at 15:44
  • While it resolves my issue, question does state `returns all elements of l1 not in l2.` Hence, this does not resolve the OP issue. – MinneapolisCoder9 Aug 20 '22 at 13:40
176

Performance Comparisons

Comparing the performance of all the answers mentioned here on Python 3.9.1 and Python 2.7.16.

Python 3.9.1

Answers are mentioned in order of performance:

  1. Arkku's set difference using subtraction "-" operation - (91.3 nsec per loop)

    mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2"
    5000000 loops, best of 5: 91.3 nsec per loop
    
  2. Moinuddin Quadri's using set().difference()- (133 nsec per loop)

    mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)"
    2000000 loops, best of 5: 133 nsec per loop
    
  3. Moinuddin Quadri's list comprehension with set based lookup- (366 nsec per loop)

     mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]"
     1000000 loops, best of 5: 366 nsec per loop
    
  4. Donut's list comprehension on plain list - (489 nsec per loop)

     mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]"
     500000 loops, best of 5: 489 nsec per loop
    
  5. Daniel Pryden's generator expression with set based lookup and type-casting to list - (583 nsec per loop) : Explicitly type-casting to list to get the final object as list, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension with set based lookup.

     mquadri$ mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)"
     500000 loops, best of 5: 583 nsec per loop
    
  6. Moinuddin Quadri's using filter() and explicitly type-casting to list (need to explicitly type-cast as in Python 3.x, it returns iterator) - (681 nsec per loop)

     mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filter(lambda x: x not in l2, l1))"
     500000 loops, best of 5: 681 nsec per loop
    
  7. Akshay Hazari's using combination of functools.reduce + filter -(3.36 usec per loop) : Explicitly type-casting to list as from Python 3.x it started returned returning iterator. Also we need to import functools to use reduce in Python 3.x

     mquadri$ python3 -m timeit "from functools import reduce; l1 = [1,2,6,8]; l2 = [2,3,5,8];" "list(reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2))"
     100000 loops, best of 5: 3.36 usec per loop
    

Python 2.7.16

Answers are mentioned in order of performance:

  1. Arkku's set difference using subtraction "-" operation - (0.0783 usec per loop)

    mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2"
    10000000 loops, best of 3: 0.0783 usec per loop
    
  2. Moinuddin Quadri's using set().difference()- (0.117 usec per loop)

    mquadri$ mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)"
    10000000 loops, best of 3: 0.117 usec per loop
    
  3. Moinuddin Quadri's list comprehension with set based lookup- (0.246 usec per loop)

     mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]"
     1000000 loops, best of 3: 0.246 usec per loop
    
  4. Donut's list comprehension on plain list - (0.372 usec per loop)

     mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]"
     1000000 loops, best of 3: 0.372 usec per loop
    
  5. Moinuddin Quadri's using filter() - (0.593 usec per loop)

     mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "filter(lambda x: x not in l2, l1)"
     1000000 loops, best of 3: 0.593 usec per loop
    
  6. Daniel Pryden's generator expression with set based lookup and type-casting to list - (0.964 per loop) : Explicitly type-casting to list to get the final object as list, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension with set based lookup.

     mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)"
     1000000 loops, best of 3: 0.964 usec per loop
    
  7. Akshay Hazari's using combination of functools.reduce + filter -(2.78 usec per loop)

     mquadri$ python -m timeit "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2)"
     100000 loops, best of 3: 2.78 usec per loop
    
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
  • 17
    This answer is a great service to humanity. I was using list comprehension and my operation failed to finish in 25 minutes; then I switched to set subtraction and it finished in 24 seconds. A miraculous improvement far beyond your timeit results. – Aaron Bramson Mar 04 '21 at 08:01
  • WHy number of loops is different for different approaches in Python 3.9? – Laurynas G Dec 28 '21 at 11:28
  • 1
    Yeah, the comprehensions have issues when both lists are big. E.g., try lists of 10000+. e.g., l1 = [x for x in range(10000); l2 = [x for x in range(100,10100)], optionally with a shuffle. The list comprehension versions are 500-1000x slower. A drawback of the set approach is that the first array must have unique elements. Also the answer appears to be mixing and matching nsec and usecs. – dlm Mar 14 '22 at 17:12
  • Thanks for the comparison. It could be interesting to mentioned that the difference of duration between list comprehension from Donut and set subtraction from Arkku is a factor *n*, *n* being the length of the list you subtract from. In the example chosen, it last 4 times longer because the second list has 4 elements, but if your second list contains 1000 elements, it will take 1000 times longer... In cases with long list, the list comprehension is just not usable. – Ken May 23 '23 at 13:42
  • Great answer. just note when using the `set` solutions, the order of the returned value is not guaranteed. – shayms8 Aug 14 '23 at 13:38
33

Expanding on Donut's answer and the other answers here, you can get even better results by using a generator comprehension instead of a list comprehension, and by using a set data structure (since the in operator is O(n) on a list but O(1) on a set).

So here's a function that would work for you:

def filter_list(full_list, excludes):
    s = set(excludes)
    return (x for x in full_list if x not in s)

The result will be an iterable that will lazily fetch the filtered list. If you need a real list object (e.g. if you need to do a len() on the result), then you can easily build a list like so:

filtered_list = list(filter_list(full_list, excludes))
Daniel Pryden
  • 59,486
  • 16
  • 97
  • 135
  • Thanks for this answer, it solves O(N) complexity problem of *in* test and keeps the order of the full_list. I would suggest to keep list comprehension format for lisibility: excludes_set = set(excludes_list) filtered_list = [x for x in full_list if x not in excludes_set] – Ken May 23 '23 at 13:49
32

Use the Python set type. That would be the most Pythonic. :)

Also, since it's native, it should be the most optimized method too.

See:

http://docs.python.org/library/stdtypes.html#set

http://docs.python.org/library/sets.htm (for older python)

# Using Python 2.7 set literal format.
# Otherwise, use: l1 = set([1,2,6,8])
#
l1 = {1,2,6,8}
l2 = {2,3,5,8}
l3 = l1 - l2
skrrgwasme
  • 9,358
  • 11
  • 54
  • 84
nonot1
  • 2,788
  • 4
  • 25
  • 41
  • 7
    When using sets it should be noted that the output of is ordered, i.e. {1,3,2} becomes {1,2,3} and {"A","C","B"} becomes {"A","B","C"} and you might not want to have that. – Pablo Reyes Mar 08 '17 at 00:19
  • 2
    this method will not work if list `l1` includes repeated elements. – jdhao Nov 08 '18 at 02:36
13

use Set Comprehensions {x for x in l2} or set(l2) to get set, then use List Comprehensions to get list

l2set = set(l2)
l3 = [x for x in l1 if x not in l2set]

benchmark test code:

import time

l1 = list(range(1000*10 * 3))
l2 = list(range(1000*10 * 2))

l2set = {x for x in l2}

tic = time.time()
l3 = [x for x in l1 if x not in l2set]
toc = time.time()
diffset = toc-tic
print(diffset)

tic = time.time()
l3 = [x for x in l1 if x not in l2]
toc = time.time()
difflist = toc-tic
print(difflist)

print("speedup %fx"%(difflist/diffset))

benchmark test result:

0.0015058517456054688
3.968189239501953
speedup 2635.179227x    
andrewchan2022
  • 4,953
  • 45
  • 48
9

Alternate Solution :

reduce(lambda x,y : filter(lambda z: z!=y,x) ,[2,3,5,8],[1,2,6,8])
Akshay Hazari
  • 3,186
  • 4
  • 48
  • 84
  • 3
    Is there any advantage to using this method? It looks like it's more complex and harder to read without much benefit. – skrrgwasme Nov 07 '15 at 01:13
  • That might seem complex . Reduce is very flexible and can be used for a lot of purposes. It is known as fold . reduce is actually foldl . Suppose you want to add more complex stuff in it then it will be possible in this function but list comprehension which is the selected best answer will only get you an output of the same type i.e list and probably of the same length while with folds you could change the output type as well . https://en.wikipedia.org/wiki/Fold_%28higher-order_function%29 . This solution is n*m or less complexity. Others may or may not be better though. – Akshay Hazari Nov 09 '15 at 04:09
  • 1
    reduce (function , list , initial accumulator (which can be of any type)) – Akshay Hazari Nov 09 '15 at 04:11
7

Using set.difference():

You can use set.difference() to get new set with elements in the set that are not in the others. i.e. set(A).difference(B) will return set with items present in A, but not in B. For example:

>>> set([1,2,6,8]).difference([2,3,5,8])
{1, 6}

It is a functional approach to get set difference mentioned in Arkku's answer (which uses arithmetic subtraction - operator for set difference).

Since sets are unordered, you'll loose the ordering of elements from initial list. (continue reading next section if you want to maintain the orderig of elements)

Using List Comprehension with set based lookup

If you want to maintain the ordering from initial list, then Donut's list comprehension based answer will do the trick. However, you can get better performance from the accepted answer by using set internally for checking whether element is present in other list. For example:

l1, l2 = [1,2,6,8], [2,3,5,8]
s2 = set(l2)  # Type-cast `l2` to `set`

l3 = [x for x in l1 if x not in s2]
                             #   ^ Doing membership checking on `set` s2

If you are interested in knowing why membership checking is faster is set when compared to list, please read this: What makes sets faster than lists?


Using filter() and lambda expression

Here's another alternative using filter() with the lambda expression. Adding it here just for reference, but it is not performance efficient:

>>> l1 = [1,2,6,8]
>>> l2 = set([2,3,5,8])

#     v  `filter` returns the a iterator object. Here I'm type-casting 
#     v  it to `list` in order to display the resultant value
>>> list(filter(lambda x: x not in l2, l1))
[1, 6]
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
5

Using filterfalse without lambda-expression

When using functions like filter or filterfalse and similar from itertools you can usually save performance by avoiding lambda-expressions and using already existing functions. Instances of list and set defines a __contains__-method to use for containment checks. The in-operator calls this method under the hood, so using x in l2 can be replaced by l2.__contains__(x). Usually this replacement is not really prettier but in this specific case it allows us to gain better performance than using a lambda-expression, when used in combination with filterfalse:

>>> from itertools import filterfalse
>>> l1 = [1, 2, 6, 8]
>>> l2 = [2, 3, 5, 8]
>>> list(filterfalse(l2.__contains__, l1))
[1, 6]

filterfalse creates an iterator yielding all elements that returns false when used as an argument for l2.__contains__.

Sets has a faster implementation of __contains__ so even better is:

>>> from itertools import filterfalse
>>> l1 = [1, 2, 6, 8]
>>> l2 = set([2, 3, 5, 8])
>>> list(filterfalse(l2.__contains__, l1))
[1, 6]

Performance

Using list:

$  python3 -m timeit -s "from itertools import filterfalse; l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filterfalse(l2.__contains__, l1))"
500000 loops, best of 5: 522 nsec per loop

Using set:

$ python3 -m timeit -s "from itertools import filterfalse; l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filterfalse(l2.__contains__, l1))"
1000000 loops, best of 5: 359 nsec per loop
2

The set approach is the best if you WANT THAT behavior. If you do not want to remove all instances of elements in list l1 that exist only once in l2, those set operations will lead to incorrect results. Suppose you have repeating elements in l1 , and probably even in l2 and want an actual difference of the two lists l1 - l2, while maintaining the order of remaining elements:

l1 = [1, 2, 3, 4, 5, 5, 6, 5, 5, 2]
l2 = [1, 2, 2, 5]
_ = [l1.remove(item) for item in l2 if item in l1] # discard return value
print(l1) # [3, 4, 5, 6, 5, 5]
  1. Careful that this will be significantly slower than set operation use this only if your use case needs it
  2. If you dont want to modify the original list - create a copy of the list first
Deepak Gaur
  • 128
  • 7
0

Sets versus list comprehension benchmark on Python 3.8

(adding up to Moinuddin Quadri's benchmarks)

tldr: Use Arkku's set solution, it's even faster than promised in comparison!

Checking existing files against a list

In my example I found it to be 40 times (!) faster to use Arkku's set solution than the pythonic list comprehension for a real world application of checking existing filenames against a list.

List comprehension:

%%time
import glob
existing = [int(os.path.basename(x).split(".")[0]) for x in glob.glob("*.txt")]
wanted = list(range(1, 100000))
[i for i in wanted if i not in existing]

Wall time: 28.2 s

Sets

%%time
import glob
existing = [int(os.path.basename(x).split(".")[0]) for x in glob.glob("*.txt")]
wanted = list(range(1, 100000))
set(wanted) - set(existing)

Wall time: 689 ms

do-me
  • 1,600
  • 1
  • 10
  • 16
0

Try this:

l1=[1,2,6,8]
l2=[2,3,5,8]
r=[]
for x in l1:
    if x in l2:
        continue
    r=r+[x]
print(r)
SX10
  • 30
  • 2
-1

Maintaining order by exploiting the ordered property of dicts (Python 3.7+)

Note: the reference implementation of dicts in Python 3.6 maintains keys in their order of insertion order, but this is not guaranteed by the specification. For 3.7 and up, this guarantee was added.

The keys of a dict function as a sort of set; duplicates are implicitly filtered out, and lookup is efficient due to hashing. Therefore, we can implement a "set difference" by building a dict using l1 as keys, and then removing any keys that appear in l2. This maintains order and uses a fast algorithm, but incurs a fair amount of constant-factor overhead.

d = dict.fromkeys(l1)
for i in l2:
    try:
        del d[i]
    except KeyError:
        pass
l3 = list(d.keys())
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153