4

I have one or more unordered sequences of (immutable, hashable) objects with possible duplicates and I want to get a sorted sequence of all those objects without duplicates.

Right now I'm using a set to quickly gather all the elements discarding duplicates, convert it to a list and then sort that:

result = set()
for s in sequences:
    result = result.union(s)
result = list(result)
result.sort()
return result

It works but I wouldn't call it "pretty". Is there a better way?

Luke404
  • 10,282
  • 3
  • 25
  • 31
  • Related: [How do you remove duplicates from a list in Python whilst preserving order?](http://stackoverflow.com/q/480214) (but without sorting). – Martijn Pieters Jan 23 '13 at 22:07

2 Answers2

13

This should work:

sorted(set(itertools.chain.from_iterable(sequences)))
JBernardo
  • 32,262
  • 10
  • 90
  • 115
  • 1
    Commenting on your answer instead of mine (I retracted it), you were right. `set()`'s are overloaded with `|` for union, not `+`, got me there. This is the correct answer instead - +1. – orlp Sep 18 '11 at 00:35
  • Same time complexity: `sorted(set().union(*sequences))`. Python 2.6+, as is `chain.from_iterable`. – agf Sep 18 '11 at 03:21
  • I noticed it when I saw `sorted(reduce(set().union, sequences))` worked as well as `sorted(reduce(set.union, sequences, set()))` -- I thought I'd get an error. – agf Sep 18 '11 at 03:48
2

I like your code just fine. It is straightforward and easy to understand.

We can shorten it just a little bit by chaining off the list():

result = set()
for s in sequences:
    result = result.union(s)
return sorted(result)

I really have no desire to try to boil it down beyond that, but you could do it with reduce():

result = reduce(lambda s, x: s.union(x), sequences, set())
return sorted(result)

Personally, I think this is harder to understand than the above, but people steeped in functional programming might prefer it.

EDIT: @agf is much better at this reduce() stuff than I am. From the comments below:

return sorted(reduce(set().union, sequences))

I had no idea this would work. If I correctly understand how this works, we are giving reduce() a callable which is really a method function on one instance of a set() (call it x for the sake of discussion, but note that I am not saying that Python will bind the name x with this object). Then reduce() will feed this function the first two iterables from sequences, returning x, the instance whose method function we are using. Then reduce() will repeatedly call the .union() method and ask it to take the union of x and the next iterable from sequences. Since the .union() method is likely smart enough to notice that it is being asked to take the union with its own instance and not bother to do any work, it should be just as fast to call x.union(x, some_iterable) as to just call x.union(some_iterable). Finally, reduce() will return x, and we have the set we want.

This is a bit tricky for my personal taste. I had to think this through to understand it, while the itertools.chain() solution made sense to me right away.

EDIT: @agf made it less tricky:

return sorted(reduce(set.union, sequences, set()))

What this is doing is much simpler to understand! If we call the instance returned by set() by the name of x again (and just like above with the understanding that I am not claiming that Python will bind the name x with this instance); and if we use the name n to refer to each "next" value from sequences; then reduce() will be repeatedly calling set.union(x, n). And of course this is exactly the same thing as x.union(n). IMHO if you want a reduce() solution, this is the best one.

--

If you want it to be fast, ask yourself: is there any way we can apply itertools to this? There is a pretty good way:

from itertools import chain
return sorted(set(chain(*sequences)))

itertools.chain() called with *sequences serves to "flatten" the list of lists into a single iterable. It's a little bit tricky, but only a little bit, and it's a common idiom.

EDIT: As @Jbernardo wrote in the most popular answer, and as @agf observes in comments, itertools.chain() returns an object that has a .from_iterable() method, and the documentation says it evaluates an iterable lazily. The * notation forces building a list, which may consume considerable memory if the iterable is a long sequence. In fact, you could have a never-ending generator, and with itertools.chain().from_iterable() you would be able to pull values from it for as long as you want to run your program, while the * notation would just run out of memory.

As @Jbernardo wrote:

sorted(set(itertools.chain.from_iterable(sequences)))

This is the best answer, and I already upvoted it.

steveha
  • 74,789
  • 21
  • 92
  • 117
  • `chain(*sequences)` is wrong. That's precisely what `chain.from_iterable` is for, as shown in the other answer. Also, if you're going to use `reduce`, don't use a lambda expression, use a function implemented in C -- in this case, `sorted(reduce(set().union, sequences))`. `reduce` or the loop will also be significantly slower than the `chain` method for sequences of significant size, because of the [time complexity of set union operations](http://wiki.python.org/moin/TimeComplexity) and having to do it for each sequence, rather than just doing all the work in one pass in the set constructor. – agf Sep 18 '11 at 02:59
  • Actually, there is a way without `reduce` or `chain` for the `union` method -- `sorted(set().union(*sequences))` works, as `union` can take any number of iterables as of 2.6. It looks to have the same time complexity as the `chain` method. – agf Sep 18 '11 at 03:19
  • Actually, I agree that I should have used `.from_iterable()` since it is available. But would you please explain to me how my code is "wrong"? It works, and it is a reasonably common idiom. Do you consider it "wrong" because, in the case where its argument is an iterator or generator, it must build a list? – steveha Sep 18 '11 at 05:26
  • It's wrong because we have a special tool for that purpose, I didn't mean it gave the wrong answer. `itertools` is designed to avoid having to build intermediate lists, as you said. I really meant something more like "the wrong way" than "wrong". – agf Sep 18 '11 at 05:48
  • The `reduce` version that works more like your original one but without the lambda is `reduce(set.union, sequences, set())` -- it doesn't rely on `union` taking multiple sequences to update from. – agf Sep 18 '11 at 06:05
  • It's probably a few microseconds faster, since it is doing slightly less work, and I like it better because it is less tricky. – steveha Sep 18 '11 at 06:11
  • JBernardo's answer is the best also IMHO and it's the one I choose, it's also the fastest (6.03sec +/- 1% on my system with some test lists). I tested some of the reduce/union answers getting 10.5~10.8 secs on average. But thanks for the great discussion and explanation in this answer! – Luke404 Sep 20 '11 at 22:49
  • Yes, I agree that JBernardo's answer is the best. Thanks for posting an interesting question. – steveha Sep 21 '11 at 06:56