9
# I have 3 lists:
L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2 = [4, 7, 8]
L3 = [5, 2, 9]
# I want to create another that is L1 minus L2's memebers and L3's memebers, so:
L4 = (L1 - L2) - L3  # Of course this isn't going to work

I'm wondering, what is the "correct" way to do this. I can do it many different ways, but Python's style guide says there should be only 1 correct way of doing each thing. I've never known what this was.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
orokusaki
  • 55,146
  • 59
  • 179
  • 257
  • 3
    There is no one correct way of doing this until you decide if you do care or do not care about duplicates and ordering. Probably some sort of list comprehension or set work depending on what you care about. – istruble Oct 16 '10 at 05:40
  • 1
    Also, is it OK to assume that all the items in the lists will be hashable all of the time? If not, or sometimes not, that would very significant. – martineau Oct 16 '10 at 12:03
  • 1
    Why don't you use sets to begin with? Then your "arithmetic" would work. – poke Oct 16 '10 at 15:41
  • Some related generic code for set intersection (when it's not certain that all items will be hashable) can be found [here](http://stackoverflow.com/questions/2197482/efficiently-knowing-if-intersection-of-two-list-is-empty-or-not-in-python/2215556#2215556) – tzot Oct 17 '10 at 00:05

6 Answers6

10

Here are some tries:

L4 = [ n for n in L1 if (n not in L2) and (n not in L3) ]  # parens for clarity

tmpset = set( L2 + L3 )
L4 = [ n for n in L1 if n not in tmpset ]

Now that I have had a moment to think, I realize that the L2 + L3 thing creates a temporary list that immediately gets thrown away. So an even better way is:

tmpset = set(L2)
tmpset.update(L3)
L4 = [ n for n in L1 if n not in tmpset ]

Update: I see some extravagant claims being thrown around about performance, and I want to assert that my solution was already as fast as possible. Creating intermediate results, whether they be intermediate lists or intermediate iterators that then have to be called into repeatedly, will be slower, always, than simply giving L2 and L3 for the set to iterate over directly like I have done here.

$ python -m timeit \
  -s 'L1=range(300);L2=range(30,70,2);L3=range(120,220,2)' \
  'ts = set(L2); ts.update(L3); L4 = [ n for n in L1 if n not in ts ]'
10000 loops, best of 3: 39.7 usec per loop

All other alternatives (that I can think of) will necessarily be slower than this. Doing the loops ourselves, for example, rather than letting the set() constructor do them, adds expense:

$ python -m timeit \
  -s 'L1=range(300);L2=range(30,70,2);L3=range(120,220,2)' \
  'unwanted = frozenset(item for lst in (L2, L3) for item in lst); L4 = [ n for n in L1 if n not in unwanted ]'
10000 loops, best of 3: 46.4 usec per loop

Using iterators, will all of the state-saving and callbacks they involve, will obviously be even more expensive:

$ python -m timeit \
  -s 'L1=range(300);L2=range(30,70,2);L3=range(120,220,2);from itertools import ifilterfalse, chain' \
  'L4 = list(ifilterfalse(frozenset(chain(L2, L3)).__contains__, L1))' 
10000 loops, best of 3: 47.1 usec per loop

So I believe that the answer I gave last night is still far and away (for values of "far and away" greater than around 5µsec, obviously) the best, unless the questioner will have duplicates in L1 and wants them removed once each for every time the duplicate appears in one of the other lists.

Brandon Rhodes
  • 83,755
  • 16
  • 106
  • 147
  • It might be possible to eke out some more performance by building a frozen set from a chain of two list iterators. – intuited Oct 16 '10 at 04:44
  • No, frozen sets have exactly the same speed as normal ones, but normally require more expense to create because you have to create an intermediate result or loop yourself if, as here, you are building them from several input iterables. – Brandon Rhodes Oct 16 '10 at 12:48
6

update::: post contains a reference to false allegations of inferior performance of sets compared to frozensets. I maintain that it's still sensible to use a frozenset in this instance, even though there's no need to hash the set itself, just because it's more correct semantically. Though, in practice, I might not bother typing the extra 6 characters. I'm not feeling motivated to go through and edit the post, so just be advised that the "allegations" link links to some incorrectly run tests. The gory details are hashed out in the comments. :::update

The second chunk of code posted by Brandon Craig Rhodes is quite good, but as he didn't respond to my suggestion about using a frozenset (well, not when I started writing this, anyway), I'm going to go ahead and post it myself.

The whole basis of the undertaking at hand is to check if each of a series of values (L1) are in another set of values; that set of values is the contents of L2 and L3. The use of the word "set" in that sentence is telling: even though L2 and L3 are lists, we don't really care about their list-like properties, like the order that their values are in or how many of each they contain. We just care about the set (there it is again) of values they collectively contain.

If that set of values is stored as a list, you have to go through the list elements one by one, checking each one. It's relatively time-consuming, and it's bad semantics: again, it's a "set" of values, not a list. So Python has these neat set types that hold a bunch of unique values, and can quickly tell you if some value is in them or not. This works in pretty much the same way that python's dict types work when you're looking up a key.

The difference between sets and frozensets is that sets are mutable, meaning that they can be modified after creation. Documentation on both types is here.

Since the set we need to create, the union of the values stored in L2 and L3, is not going to be modified once created, it's semantically appropriate to use an immutable data type. This also allegedly has some performance benefits. Well, it makes sense that it would have some advantage; otherwise, why would Python have frozenset as a builtin?

update...

Brandon has answered this question: the real advantage of frozen sets is that their immutability makes it possible for them to be hashable, allowing them to be dictionary keys or members of other sets.

I ran some informal timing tests comparing the speed for creation of and lookup on relatively large (3000-element) frozen and mutable sets; there wasn't much difference. This conflicts with the above link, but supports what Brandon says about them being identical but for the aspect of mutability.

...update

Now, because frozensets are immutable, they don't have an update method. Brandon used the set.update method to avoid creating and then discarding a temporary list en route to set creation; I'm going to take a different approach.

items = (item for lst in (L2, L3) for item in lst)

This generator expression makes items an iterator over, consecutively, the contents of L2 and L3. Not only that, but it does it without creating a whole list-full of intermediate objects. Using nested for expressions in generators is a bit confusing, but I manage to keep it sorted out by remembering that they nest in the same order that they would if you wrote actual for loops, e.g.

def get_items(lists):
    for lst in lists:
        for item in lst:
            yield item

That generator function is equivalent to the generator expression that we assigned to items. Well, except that it's a parametrized function definition instead of a direct assignment to a variable.

Anyway, enough digression. The big deal with generators is that they don't actually do anything. Well, at least not right away: they just set up work to be done later, when the generator expression is iterated. This is formally referred to as being lazy. We're going to do that (well, I am, anyway) by passing items to the frozenset function, which iterates over it and returns a frosty cold frozenset.

unwanted = frozenset(items)

You could actually combine the last two lines, by putting the generator expression right inside the call to frozenset:

unwanted = frozenset(item for lst in (L2, L3) for item in lst)

This neat syntactical trick works as long as the iterator created by the generator expression is the only parameter to the function you're calling. Otherwise you have to write it in its usual separate set of parentheses, just like you were passing a tuple as an argument to the function.

Now we can build a new list in the same way that Brandon did, with a list comprehension. These use the same syntax as generator expressions, and do basically the same thing, except that they are eager instead of lazy (again, these are actual technical terms), so they get right to work iterating over the items and creating a list from them.

L4 = [item for item in L1 if item not in unwanted]

This is equivalent to passing a generator expression to list, e.g.

L4 = list(item for item in L1 if item not in unwanted)

but more idiomatic.

So this will create the list L4, containing the elements of L1 which weren't in either L2 or L3, maintaining the order that they were originally in and the number of them that there were.


If you just want to know which values are in L1 but not in L2 or L3, it's much easier: you just create that set:

L1_unique_values = set(L1) - unwanted

You can make a list out of it, as does st0le, but that might not really be what you want. If you really do want the set of values that are only found in L1, you might have a very good reason to keep that set as a set, or indeed a frozenset:

L1_unique_values = frozenset(L1) - unwanted

...Annnnd, now for something completely different:

from itertools import ifilterfalse, chain
L4 = list(ifilterfalse(frozenset(chain(L2, L3)).__contains__, L1))
Community
  • 1
  • 1
intuited
  • 23,174
  • 7
  • 66
  • 88
  • +1 Very informative. The most recent addition (with itertools) is very nice. I'd say you've earned your Ph.D in filtering lists based on inclusion in a set of lists. – aaronasterling Oct 16 '10 at 08:02
  • Am I missing something or is your generator expression just `itertools.chain`? If yes, just use that (you can keep the explanation of generators and genrator expressions though, people need to learn about them). –  Oct 16 '10 at 12:21
  • 3
    No, frozen sets are *not* more efficient than normal sets, they merely have their "modify bit unset" so that you can use them as dictionary keys. They are, in other words, hashable, which is a really big deal in terms of what you can and cannot do with them in Python. (They can also be members of other sets that way.) It also means that you can pass them around to other routines without being afraid that they will come back modified, but hashability is a much bigger deal. But, as your answer illustrates, they are awkward to create. So I think your solution is overly complex with no "win". – Brandon Rhodes Oct 16 '10 at 12:45
  • 1
    @Brandon: Thanks for the info; I had forgotten about the hashability aspect. As you say, this is pretty "key". The [link](http://labs.kortina.net/2010/10/13/list-dict-set-and-frozen-set-performance-in-python/) I posted does show a significant speed advantage for frozen sets; any comment on that? – intuited Oct 16 '10 at 12:50
  • 1
    @delnan: it seems to be a bit of a contentious issue whether `itertools.chain` is actually preferable to nested genexps. I gather that it's faster, but is arguably less self-explanatory than `for .. in .. for .. in ..` From what I've gathered (mostly from posts/comments on this site), a significant portion of the Python userbase finds nested genexps easier to read and/or is reluctant to import from `itertools` just to get different syntax. – intuited Oct 16 '10 at 12:54
  • @intuited: Yes! My comment is that you are being mislead by that post. If you will edit the sample code shown in the link, change the line `words_frozenset = frozenset(words_set)` so that it says `words_frozenset = set(words_set)` instead, and re-run the program, the second set *still* wins. The program is not showing that a frozenset is faster. It is showing that a set created directly from another set has a speed advantage, probably because the members get laid out in memory in a more efficient order than when the members are added in the order they happen to arrive from the input file. – Brandon Rhodes Oct 16 '10 at 12:57
  • @Brandon: I'm not sure if I read you correctly or not. I swapped the way that the test's `set` and `frozenset` are created, and the frozenset test is consistently running in about 5/6 of the time of the set test. I tried a few different approaches, e.g. creating them both directly from the original list; the ratio stayed pretty much the same. [This gist revision](http://gist.github.com/629781/04aadc5621755cdb36d89a26c6938d89fc1e8c8e) does a swap of the instantiation of those variables. When I run it, the frozenset is still faster. Maybe you can figure out where the discrepancy lies? – intuited Oct 16 '10 at 13:44
  • @intuited: I will investigate why they are not running at the same speed. My only point is that if you will get your code from gist and change the frozen set to a normal set, then the normal set will also win the race. You will get two sets that have different speeds. Try it! – Brandon Rhodes Oct 16 '10 at 14:14
  • @intuited: Having experimented with your branch a bit more, I note that it is always the second set-or-frozenset that gets tested that winds up winning by about 30%. My guess is that this author is doing something subtly wrong with their testing logic, which is generally what comes of trying to re-implement `timeit()`. – Brandon Rhodes Oct 16 '10 at 14:28
  • @intuited: Yes! I think I see it! The problem, I think, is that rather than putting each test in its own loop, he does them all in one loop — so they each see an L1 and L2 cache in the state the previous guy left it in, which gives some of them an unfair advantage! – Brandon Rhodes Oct 16 '10 at 14:36
  • @Brandon: looks like [that's it](https://gist.github.com/629781/9b34a593c9865622e2d6d693af55bcf4181a9782). I got rid of the list and dict tests, added 2 more sets of each of the set tests, and found that the times tend to even out in the later sets. Good eye. – intuited Oct 16 '10 at 15:09
  • @Brandon: Is what you said about frozen sets 'merely [having] their "modify bit unset"' documented somewhere, or did you learn that from reading the Python source, or from another source? – intuited Oct 16 '10 at 15:30
  • @intuited: I gave the "Mighty Dictionary" talk at PyCon, so I know how Python uses hash tables at the C level. With that knowledge, I didn't need to look at the set and frozenset source code; I already know how hash table lookup works, and I know that making the data structure read-only cannot make lookup any faster. My talk is here: http://blip.tv/file/3332763 – Brandon Rhodes Oct 16 '10 at 19:41
  • @Brandon: awesome, I'll check it out. Makes sense, now that I think about it, actually.. :) I guess it could theoretically choose the hashtable size in a way that would minimize collisions for that data set, but that would probably be a case of diminishing returns and escalating memory usage, and/or require significant investment at set instantiation. – intuited Oct 16 '10 at 23:47
  • @intuited: indeed, there would be little benefit — since hash tables grow automatically to stay ahead of the size of their contents, the only really bit mismatch can happen if you empty out a big set or dictionary and leave its slots mostly empty. I've enjoyed this exchange, by the way — we had to dig way farther into this guy's issue than I ever expected. :-) Are you coming to PyCon Atlanta 2011? I'll buy you a beer if I see you there! – Brandon Rhodes Oct 16 '10 at 23:57
0

Assuming your individual lists won't contain duplicates....Use Set and Difference

L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2 = [4, 7, 8]
L3 = [5, 2, 9]
print(list(set(L1) - set(L2) - set(L3)))
st0le
  • 33,375
  • 8
  • 89
  • 89
0

Doing such operations in Lists can hamper your program's performance very soon. What happens is with each remove, List operations do a fresh malloc & move elements around. This can be expensive if you have a very huge list or otherwise. So I would suggest this -

I am assuming your list has unique elements. Otherwise you need to maintain a list in your dict having duplicate values. Anyway for the data your provided, here it is-

METHOD 1

d = dict()
for x in L1: d[x] = True

# Check if L2 data is in 'd'
for x in L2:
    if x in d:
        d[x] = False

for x in L3:
    if x in d:
        d[x] = False

# Finally retrieve all keys with value as True.
final_list = [x for x in d if d[x]]

METHOD 2 If all that looks like too much code. Then you could try using set. But this way your list will loose all duplicate elements.

final_set  = set.difference(set(L1),set(L2),set(L3))
final_list = list(final_set)
Srikar Appalaraju
  • 71,928
  • 54
  • 216
  • 264
  • The list comprehension doesn't do remove operations which are the expensive ones. – aaronasterling Oct 16 '10 at 04:38
  • #aaron yes I know. I was referring to the solution posted by Santiago. – Srikar Appalaraju Oct 16 '10 at 04:43
  • 1
    Hey, you're basically using a dictionary as a set. They have a whole other data type for that: http://docs.python.org/library/stdtypes.html#types-set – intuited Oct 16 '10 at 04:48
  • @intuited yes I realized that after I posted the solution :) . Hence I poseted the method 2 with the shorter `set` version. – Srikar Appalaraju Oct 16 '10 at 04:50
  • Ah, okay. I didn't refresh; it didn't notify me. Clearly a bug in stackoverflow :) Though see the discussion for http://stackoverflow.com/questions/3947654/python-removing-items-from-lists/3947659#3947659 – intuited Oct 16 '10 at 05:01
0

This may be less pythonesque than the list-comprehension answer, but has a simpler look to it:

l1 = [ ... ]
l2 = [ ... ]

diff = list(l1) # this copies the list
for element in l2:
    diff.remove(element)

The advantage here is that we preserve order of the list, and if there are duplicate elements, we remove only one for each time it appears in l2.

salezica
  • 74,081
  • 25
  • 105
  • 166
  • 1
    That is incredibly expensive and is, to the contrary, more complicated to look at than a simple comprehension. – aaronasterling Oct 16 '10 at 04:37
  • A taste issue, it seems. I like lists comprehensions a lot, I actually tend to overuse them, but I don't think "n for n in L if n not in..." is nice in the eye. One way or another, it is, I'll admit, computationally expensive. – salezica Oct 16 '10 at 04:44
0

I think intuited's answer is way too long for such a simple problem, and Python already has a builtin function to chain two lists as a generator.

The procedure is as follows:

  1. Use itertools.chain to chain L2 and L3 without creating a memory-consuming copy
  2. Create a set from that (in this case, a frozenset will do because we don't change it after creation)
  3. Use list comprehension to filter out elements that are in L1 and also in L2 or L3. As set/frozenset lookup (x in someset) is O(1), this will be very fast.

And now the code:

L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2 = [4, 7, 8]
L3 = [5, 2, 9]

from itertools import chain
tmp = frozenset(chain(L2, L3))
L4 = [x for x in L1 if x not in tmp] # [1, 3, 6]

This should be one of the fastest, simplest and least memory-consuming solution.

AndiDog
  • 68,631
  • 21
  • 159
  • 205
  • It is not fastest; check the tests in my post. Putting an iterator in between the set and the already-iterable lists just slows things down. – Brandon Rhodes Oct 16 '10 at 19:37
  • @Brandon Craig Rhodes: Ok, let's say "one of the fastest solutions". Thanks for posting your benchmark results. – AndiDog Oct 16 '10 at 20:32
  • Indeed — your solutions is definitely one of the fastest, and certainly one of the class of O(*n*log*m*) solutions that this problem deserves. I just wanted to make sure that programmers realize that iterators are not pixie dust that are somehow faster than looping over a container itself; every item returned by an iterator requires its scope to be re-activated and its code re-commenced, so their benefits do not come for free. – Brandon Rhodes Oct 16 '10 at 21:34