5

I was trying to write a function to remove duplicates from a list in Python.

But after I did this, I found the list was sorted by converting it into set and back to a list.

Here is the script:

>>> l = [9,10,10,11,1,1,1,22,2,2,2]
>>> s = set(l)
>>> s
set([1, 2, 9, 10, 11, 22])
>>> l2 = list(s)
>>> l2
[1, 2, 9, 10, 11, 22]
>>> l2 = list(set(l))
>>> l2
[1, 2, 9, 10, 11, 22]
>>> 

The set s is ordered (at least ordered when printing it).

Why the set is ordered?

And what is the time complexity if I remove duplicates by running this:

def remove_duplicates(nums):
    return list(set(nums))
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
xhanshawn
  • 61
  • 1
  • 5
  • Briefly: `set` objects are arbitrarily ordered. – TigerhawkT3 Mar 27 '16 at 00:14
  • 2
    @TigerhawkT3 Please don't be so aggressive in closing. There is more to the question than just arbitrary set ordering. – Raymond Hettinger Mar 27 '16 at 00:17
  • `set_l = [x for (i,x) in enumerate(l) if l.index(x) == i]` – ml-moron Mar 27 '16 at 00:19
  • @RaymondHettinger - Please don't be so aggressive in reopening. Questions should have one question per question. – TigerhawkT3 Mar 27 '16 at 00:19
  • 2
    @TigerhawkT3 That isn't a StackOverflow question guideline. Many good questions about how something works have legitimate sub-questions raised by the example. – Raymond Hettinger Mar 27 '16 at 00:28
  • 1
    @TigerhawkT3 OP seems to know the expected behavior of `set`. The question is not about if `set` is ordered, but __why__ `set` seems to be ordered when printed. The question is fairly legitimate and intelligent. Please reread the question to make sure you understand the question before closing. – Patrick the Cat Mar 27 '16 at 02:32

1 Answers1

5

The running time for the list(set(data)) approach is O(n).

The set appears ordered as an artifact of how integers are hashed. With other inputs, the data will scramble away from sorted order.

To overcome arbitrary ordering, use this idiom which is also O(n): list(collections.OrderedDict.fromkeys(data))

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • 1
    Gocha! You are right. And after I tested it with some very large numbers, the returned list was not ordered. Thank you very much. – xhanshawn Mar 27 '16 at 00:30
  • Your new approach is O(n log n) is it not? – Timothy Shields Mar 30 '16 at 04:01
  • @TimothyShields It is not. The number of comparisons is O(n). This easy to verify through instrumentation. – Raymond Hettinger Mar 30 '16 at 07:59
  • @RaymondHettinger Wouldn't that mean you have a deterministic comparison-based way to sort a list of n distinct numbers in O(n) time? That isn't possible. Your solution is O(n log n), unless I am mistaken about the implementation of OrderedDict. Your answer is good - I just want to clarify this point so people aren't being misinformed. – Timothy Shields Mar 30 '16 at 12:11
  • @TimothyShields You are in-fact profoundly misunderstanding what OrderedDict does and how it is implemented. It does not sort; instead, remembers insertion order. Sorry dude, but your posts are flat-out wrong. Please at least read the docs or try out an example at the interactive prompt before contradicting the author of OrderedDict. – Raymond Hettinger Mar 31 '16 at 03:27
  • 1
    @RaymondHettinger I see what you mean now. :). I had OrderedDict confused with something like a sorted RB tree, because at first glance that's what the name suggests to me. Your tone is uncalled for. I said I wanted to "clarify." I said "unless I am mistaken about the implementation." That means I was unsure about what I was saying and was inviting you to describe what I am misunderstanding in a friendly tone. Next time try: "That would be true if OrderedDict actually sorted, but it doesn't. It just remembers the order in which items were added and still uses a hash table." – Timothy Shields Mar 31 '16 at 03:41