update::: post contains a reference to false allegations of inferior performance of sets compared to frozensets. I maintain that it's still sensible to use a frozenset in this instance, even though there's no need to hash the set itself, just because it's more correct semantically. Though, in practice, I might not bother typing the extra 6 characters. I'm not feeling motivated to go through and edit the post, so just be advised that the "allegations" link links to some incorrectly run tests. The gory details are hashed out in the comments. :::update
The second chunk of code posted by Brandon Craig Rhodes is quite good, but as he didn't respond to my suggestion about using a frozenset (well, not when I started writing this, anyway), I'm going to go ahead and post it myself.
The whole basis of the undertaking at hand is to check if each of a series of values (L1
) are in another set of values; that set of values is the contents of L2
and L3
. The use of the word "set" in that sentence is telling: even though L2
and L3
are list
s, we don't really care about their list-like properties, like the order that their values are in or how many of each they contain. We just care about the set (there it is again) of values they collectively contain.
If that set of values is stored as a list, you have to go through the list elements one by one, checking each one. It's relatively time-consuming, and it's bad semantics: again, it's a "set" of values, not a list. So Python has these neat set types that hold a bunch of unique values, and can quickly tell you if some value is in them or not. This works in pretty much the same way that python's dict
types work when you're looking up a key.
The difference between sets and frozensets is that sets are mutable, meaning that they can be modified after creation. Documentation on both types is here.
Since the set we need to create, the union of the values stored in L2
and L3
, is not going to be modified once created, it's semantically appropriate to use an immutable data type. This also allegedly has some performance benefits. Well, it makes sense that it would have some advantage; otherwise, why would Python have frozenset
as a builtin?
update...
Brandon has answered this question: the real advantage of frozen sets is that their immutability makes it possible for them to be hashable, allowing them to be dictionary keys or members of other sets.
I ran some informal timing tests comparing the speed for creation of and lookup on relatively large (3000-element) frozen and mutable sets; there wasn't much difference. This conflicts with the above link, but supports what Brandon says about them being identical but for the aspect of mutability.
...update
Now, because frozensets are immutable, they don't have an update method. Brandon used the set.update
method to avoid creating and then discarding a temporary list en route to set creation; I'm going to take a different approach.
items = (item for lst in (L2, L3) for item in lst)
This generator expression makes items
an iterator over, consecutively, the contents of L2
and L3
. Not only that, but it does it without creating a whole list-full of intermediate objects. Using nested for
expressions in generators is a bit confusing, but I manage to keep it sorted out by remembering that they nest in the same order that they would if you wrote actual for loops, e.g.
def get_items(lists):
for lst in lists:
for item in lst:
yield item
That generator function is equivalent to the generator expression that we assigned to items
. Well, except that it's a parametrized function definition instead of a direct assignment to a variable.
Anyway, enough digression. The big deal with generators is that they don't actually do anything. Well, at least not right away: they just set up work to be done later, when the generator expression is iterated. This is formally referred to as being lazy. We're going to do that (well, I am, anyway) by passing items
to the frozenset
function, which iterates over it and returns a frosty cold frozenset.
unwanted = frozenset(items)
You could actually combine the last two lines, by putting the generator expression right inside the call to frozenset
:
unwanted = frozenset(item for lst in (L2, L3) for item in lst)
This neat syntactical trick works as long as the iterator created by the generator expression is the only parameter to the function you're calling. Otherwise you have to write it in its usual separate set of parentheses, just like you were passing a tuple as an argument to the function.
Now we can build a new list in the same way that Brandon did, with a list comprehension. These use the same syntax as generator expressions, and do basically the same thing, except that they are eager instead of lazy (again, these are actual technical terms), so they get right to work iterating over the items and creating a list from them.
L4 = [item for item in L1 if item not in unwanted]
This is equivalent to passing a generator expression to list
, e.g.
L4 = list(item for item in L1 if item not in unwanted)
but more idiomatic.
So this will create the list L4
, containing the elements of L1
which weren't in either L2
or L3
, maintaining the order that they were originally in and the number of them that there were.
If you just want to know which values are in L1
but not in L2
or L3
, it's much easier: you just create that set:
L1_unique_values = set(L1) - unwanted
You can make a list out of it, as does st0le, but that might not really be what you want. If you really do want the set of values that are only found in L1
, you might have a very good reason to keep that set as a set
, or indeed a frozenset
:
L1_unique_values = frozenset(L1) - unwanted
...Annnnd, now for something completely different:
from itertools import ifilterfalse, chain
L4 = list(ifilterfalse(frozenset(chain(L2, L3)).__contains__, L1))