3

I have a list of elements which should be unique by construction. I mean, no element will be present more than once in the list.

I want to efficiently test if an item is present in that list, and this for many items.

The test is more efficient if I convert the list into a set.

Now my questions are about how to efficiently build the set.

I guess that when I do my_set = set(my_list), python somehow has to test membership of items in the list as it progressively constructs the set.

  1. Is this sub-optimal given the knowledge that I know that the list does not contain duplicates?

  2. Is it possible to do better?

  3. Does the answer to the question above change if instead of a list, I have an iterator (about which I still know that the items it will yield will be unique)?

rezca
  • 116
  • 6
bli
  • 7,549
  • 7
  • 48
  • 94
  • **No**, that's not what `set(my_list)` does. It simply hashes the objects and adds it if not already present in the set i.e no membership check is done is done against the list. – Ashwini Chaudhary Mar 26 '15 at 18:55
  • Why do you think this would be slow? `st = set(range(10000000))` takes at most half a second on my computer. That's ten million items. And I only had to do that step once. So I'm pretty sure this is a case of premature optimization. – Steven Rumbalski Mar 26 '15 at 19:02

2 Answers2

4

Python doesn't do explicit membership tests when constructing a set. It doesn't need to; sets are unique by their nature, which is that the members are indexed by their hash values. So in constructing a set, all Python does is hash each element in turn and then insert it in the appropriate place.

The Python docs on time complexity don't explicitly list set construction, but they do say that most operations are the same as for a dict, and insertion into a dict is O(1), by which we can assume that set construction is O(n).

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • 1
    I believe that is incorrect. Let's say I create this class `class BadHash(object): def __hash__(self):return 1`. Then I create a set `set([BadHash(), BadHash(), BadHash()])`. The resulting set has three Items. So items whose hash collides need to be checked for equality. – Steven Rumbalski Mar 26 '15 at 18:59
  • 1
    Yes, it will also do an equality check in case of a hash collision. That's still a single operation, and is not the same as a uniqueness check across the whole set for every item. – Daniel Roseman Mar 26 '15 at 19:02
  • Fair enough. A tiny membership test on a hash collision sure is a lot different than a membership across the whole set for each item. – Steven Rumbalski Mar 26 '15 at 19:04
  • @StevenRumbalski correct me if I'm wrong but shouldn't you also be defining __cmp__, eg. `def __cmp__(self, other): if self.__hash__() == other.__hash__(): return 0` – MrAlexBailey Mar 26 '15 at 19:10
  • @Jkdc Yes, we should. It's recommended to define either `__eq__` or `__cmp__` when we define our own `__hash__` method. – Ashwini Chaudhary Mar 26 '15 at 19:13
  • @Jkdc: For my toy example not defining those suffices. – Steven Rumbalski Mar 26 '15 at 20:35
  • @StevenRumbalski: The items in my set are character strings with a prefix and a suffix in common. Would it make sense to convert them to a custom type with a cleverly defined `__hash__` function ? – bli Mar 27 '15 at 10:20
  • @bli: Are you kidding me? No! – Steven Rumbalski Mar 27 '15 at 12:51
  • 1
    @StevenRumbalski: I thought that maybe there could be ways to reduce the risks of hash collision or to compute the hash more quickly knowing that just part of the string may actually vary, and that this would in the end compensate the cost of type conversion. I'm not kidding, I just have practically no knowledge of these things, so my question was based on uninformed speculations. – bli Mar 29 '15 at 12:04
3

Since set() uses a hashtable (see How is set() implemented?), more time would be spent hashing than comparing, and this is unavoidable.

I assume your dataset is very large if you're this concerned about performance. The only way you're going to get better performance is by creating the set() in the first place and avoiding the intermediate memory usage of the list().

$ python3 -m timeit 'set(list(range(100000)))'
100 loops, best of 3: 8.69 msec per loop

$ python3 -m timeit 'set(range(100000))'
100 loops, best of 3: 7.67 msec per loop

$ python3 -m timeit 'frozenset(range(100000))'
100 loops, best of 3: 7.68 msec per loop
Community
  • 1
  • 1
rezca
  • 116
  • 6