7

I have a list of 5 million string elements, which are stored as a pickle object.

a = ['https://en.wikipedia.org/wiki/Data_structure','https://en.wikipedia.org/wiki/Data_mining','https://en.wikipedia.org/wiki/Statistical_learning_theory','https://en.wikipedia.org/wiki/Machine_learning','https://en.wikipedia.org/wiki/Computer_science','https://en.wikipedia.org/wiki/Information_theory','https://en.wikipedia.org/wiki/Statistics','https://en.wikipedia.org/wiki/Mathematics','https://en.wikipedia.org/wiki/Signal_processing','https://en.wikipedia.org/wiki/Sorting_algorithm','https://en.wikipedia.org/wiki/Data_structure','https://en.wikipedia.org/wiki/Quicksort','https://en.wikipedia.org/wiki/Merge_sort','https://en.wikipedia.org/wiki/Heapsort','https://en.wikipedia.org/wiki/Insertion_sort','https://en.wikipedia.org/wiki/Introsort','https://en.wikipedia.org/wiki/Selection_sort','https://en.wikipedia.org/wiki/Timsort','https://en.wikipedia.org/wiki/Cubesort','https://en.wikipedia.org/wiki/Shellsort']

To remove duplicates, I use set(a), then I made it a list again through list(set(a)).

My question is:

Even if I restart python, and read the list from the pickle file, will the order of list(set(a)) be the same every time?

I'm eager to know how this hash -> list ordering works.


I tested with a small dataset and it seems to have a consistent ordering.

In [50]: a = ['x','y','z','k']

In [51]: a
['x', 'y', 'z', 'k']

In [52]: list(set(a))
['y', 'x', 'k', 'z']

In [53]: b=list(set(a))

In [54]: list(set(b))
['y', 'x', 'k', 'z']

In [55]: del b

In [56]: b=list(set(a))

In [57]: b
['y', 'x', 'k', 'z']
aerin
  • 20,607
  • 28
  • 102
  • 140
  • There is definitely a random element involved in the hashing procedure. – Willem Van Onsem May 04 '16 at 20:59
  • For starters, the order of the hash isn't guaranteed, so the order of the list wouldn't be guaranteed either. – Makoto May 04 '16 at 20:59
  • I guess you can use [ordered-set](https://pypi.python.org/pypi/ordered-set) instead of `set` – MaxU - stand with Ukraine May 04 '16 at 21:02
  • 2
    @TigerhawkT3: I disagree with the dupe. The dupe talks about dictionaries; this is asking about the order of a `list(set(l))`. – Makoto May 04 '16 at 21:03
  • 2
    Dictionaries are sets with attached values. The hashing is the same. – TigerhawkT3 May 04 '16 at 21:05
  • @TigerhawkT3 this is question about hash-to-list ordering, do you still think it's the same? – aerin May 04 '16 at 21:07
  • 2
    @TigerhawkT3 I first marked this question as a duplicate but I just realized that it's slightly different with the dup and other similar questions as well. The question's tops says that clearly. – Mazdak May 04 '16 at 21:15
  • Also see [this comment](http://stackoverflow.com/questions/5629023/key-order-in-python-dictionaries#comment48032902_5629023) and [this question](http://stackoverflow.com/questions/24482872/i-know-python-dict-is-order-less-but-why-i-am-getting-keys-in-same-order-always) and the implementation detail in [this documentation](https://docs.python.org/2/library/stdtypes.html#dict.items). – TigerhawkT3 May 04 '16 at 21:15
  • @DDDrupal Please read http://stackoverflow.com/questions/30537090/whats-the-logic-behind-pythons-hash-function-order if you want to understand the logic, but about changing the order over the time, if you don't modify the list the order would be same over the time. – Mazdak May 04 '16 at 21:17
  • As the question I linked above says, in the documentation the answer points to, it depends on the "history of insertions and deletions." I'd say that means that if you build it in the exact same way you'll get the exact same iteration order. Note that it also says that it's implementation-dependent, so your program shouldn't rely on this ordering. – TigerhawkT3 May 04 '16 at 21:21
  • 2
    @TigerhawkT3 I'd argue that even if the answer is the same as for dicts, the explanation that sets, and list(set(l)) will do the same thing is enough to make this not a dupe. – mbrig May 04 '16 at 21:36
  • 1
    This question Y may be about sets while the other question Y is about dictionaries, but question X is about hash tables (see the [XY Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)). Ignoring a useful resource because of a trivial difference like that would be a disservice. – TigerhawkT3 May 04 '16 at 21:41
  • 2
    @TigerhawkT3 That's definitely true, but if I'm wondering what order my list(set(l)) will be in, an answer explaining that a set is a hash table, and linking to the explanation of dictionary/hash table ordering (maybe with a summary, maybe not), is substantially more useful than a link to a question about dicts that I may not understand as being related. There may already be a dupe target that does the same thing. – mbrig May 04 '16 at 21:46
  • Some of these comments are about the semantics of the question not the question. please delete? – Merlin Jun 10 '16 at 21:49

1 Answers1

2

I would suggest an auxiliary set() to ensure unicity when adding items on the list, thus preserving the order of your list(), and not storing the set() per se.

First, load your list and create a set with the contents Before adding items to your list, check that they are not in the set (much faster search using "in" from the set rather than the list, specially if there are many elements) Pickle your list, the order will be exactly the one you want

Drawback: takes twice as much memory than handling only a set()

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219