0

Is there an operator to remove elements from a List based on the content of a Set?

What I want to do is already possible by doing this:

words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}

new_list = [w for w in words if w not in my_set]
# ["hello", "how", "today", "hello"]

What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets. Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.

So is there some way of computing a difference between a List and a Set in a shorter/cleaner/more efficient way than using a list comprehension, like for example:

# I know this is not possible, but does something approaching exist?
new_list = words - my_set

TL;DR

I'm looking for a way to remove all element presents in a Set from a List, that is either:

  • cleaner (with a built-in perhaps)
  • and/or more efficient

than what I know can be done with list comprehensions.

Jivan
  • 21,522
  • 15
  • 80
  • 131
  • 5
    I think the code you're using is pythonic enough?! – linusg Mar 26 '16 at 19:11
  • 6
    I feel like the only valid answer would be a simple “no”, which leaves nothing else to say, so I’m not posting it as an answer. Not sure what else to make of this question. – poke Mar 26 '16 at 19:11
  • 1
    You could overload the sub operator, check http://stackoverflow.com/a/3428637/1477064 – xvan Mar 26 '16 at 19:14
  • You *could* use `filter(lambda w: w not in set, words)` instead of list-comprehension – OneCricketeer Mar 26 '16 at 19:17
  • 7
    @cricket_007 `filter` in Python 3 returns an iterable, not a list. And this doesn’t give you any benefit over the list comprehension that OP already has. – poke Mar 26 '16 at 19:18
  • 1
    I was writing an answer covering some reasoning on why there isn’t a better way while this answer was getting closed. I’ll post it, if the question gets reopened. – poke Mar 26 '16 at 20:01

3 Answers3

4

Unfortunately, the only answer for this is: No, there is no built-in way, implemented in native code, for this kind of operation.

What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets.

I think what’s important here is the “looks” part. Yes, list comprehensions run more within Python than a set difference, but I assume that most of your application actually runs within Python (otherwise you should probably be programming in C instead). So you should consider whether it really matters much. Iterating over a list is fast in Python, and a membership test on a set is also super fast (constant time, and implemented in native code). And if you look at list comprehensions, they are also very fast. So it probably won’t matter much.

Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.

It is true that native operations are faster, but they are also more specialized, limited and allow for less flexibility. For sets, a difference is pretty easy. The set difference is a mathematical concept and is very clearly defined.

But when talking about a “list difference” or a “list and set difference” (or more generalized “list and iterable difference”?) it becomes a lot more unclear. There are a lot open questions:

  • How are duplicates handled? If there are two X in the original list and only one X in the subtrahend, should both X disappear from the list? Should only one disappear? If so, which one, and why?

  • How is order handled? Should the order be kept as in the original list? Does the order of the elements in the subtrahend have any impact?

  • What if we want to subtract members based on some other condition than equality? For sets, it’s clear that they always work on the equality (and hash value) of the members. Lists don’t, so lists are by design a lot more flexible. With list comprehensions, we can easily have any kind of condition to remove elements from a list; with a “list difference” we would be restricted to equality, and that might actually be a rare situation if you think about it.

    It’s maybe more likely to use a set if you need to calculate differences (or even some ordered set). And for filtering lists, it might also be a rare case that you want to end up with a filtered list, so it might be more common to use a generator expression (or the Python 3 filter() function) and work with that later without having to create that filtered list in memory.

What I’m trying to say is that the use case for a list difference is not as clear as a set difference. And if there was a use case, it might be a very rare use case. And in general, I don’t think it’s worth to add complexity to the Python implementation for this. Especially when the in-Python alternative, e.g. a list comprehension, is as fast as it already is.

Community
  • 1
  • 1
poke
  • 369,085
  • 72
  • 557
  • 602
  • Thanks for the comprehensive and thoughtful answer – Jivan Mar 26 '16 at 22:45
  • I accepted @Dunes' post because it positively answers the question "is there a more efficient way to...", but I hesitated and your answer is as valid as his. – Jivan Mar 26 '16 at 22:54
1

First things first, are you prematurely worrying about an optimisation problem that isn't really an issue? I have to to have lists with at least 10,000,000 elements before I even get into the range of this operation taking 1/10ths of a second.

If you're working with large data sets then you may find it advantageous to move to using numpy.

import random
import timeit

r = range(10000000)

setup = """
import numpy as np
l = list({!r})
s = set(l)
to_remove = {!r}
n = np.array(l)
n_remove = np.array(list(to_remove))
""".format(r, set(random.sample(r, 3)))

list_filter = "[x for x in l if x not in to_remove]"
set_filter = "s - to_remove"
np_filter = "n[np.in1d(n, n_remove, invert=True)]"

n = 1
l_time = timeit.timeit(list_filter, setup, number=n)
print("lists:", l_time)
s_time = timeit.timeit(set_filter, setup, number=n)
print("sets:", s_time)
n_time = timeit.timeit(np_filter, setup, number=n)
print("numpy:", n_time)

returns the following results -- with numpy an order of magnitude faster than using sets.

lists: 0.8743789765043315
sets: 0.20703006886620656
numpy: 0.06197169088128707
Dunes
  • 37,291
  • 7
  • 81
  • 97
  • Awesome. It shows that `[list comp]` and `set() - set()` are no that much different from one another, and also that numpy blows them all by a huge margin. Also, I appreciate suggesting an alternative rather than just saying "it's not worth it", even if @poke's answer is very useful as well. Thanks a lot. – Jivan Mar 26 '16 at 22:49
1

I agree with poke. Here is my reasoning:

The easiest way to do it would be using a filter:

words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}

new_list = filter(lambda w: w not in my_set, words)

And using Dunes solution, I get these times:

lists:  0.87401028
sets:   0.55103887
numpy:  0.16134396
filter: 0.00000886 WOW beats numpy by various orders of magnitude !!!

But wait, we are making a flawed comparison because we are comparing the time of making a list strictly (comprehension and set difference) vs. lazily (numpy and filter).

If I run Dunes solution but producing the actual lists, I get:

lists:  0.86804159
sets:   0.56945663  
numpy:  1.19315723
filter: 1.68792561

Now numpy is slightly more efficient than using a simple filter, but both are not better than the list comprehension, which was the first and more intuitive solution.

I would definitely use a filter over the comprehension, except if I need to use the filtered list more than once (although I could tee it).

chapelo
  • 2,519
  • 13
  • 19
  • Thanks. Why would you use a `filter` over the comprehension? – Jivan Mar 27 '16 at 12:42
  • The comprehension produces the actual list with all its elements, the filter is a generator that gives you one result at a time. If all you need to do is iterate over the elements in order, use the filter, but if you want to know the length, or extract elements by index or by matching them, you may need to have the actual list. – chapelo Mar 27 '16 at 12:49