You can store mutable objects in hash tables, you just probably shouldn’t because if you have a pointer to an object, you store the object in the hash table, you mutate the object, and then you attempt to look it up via the pointer, you (almost certainly) won’t find it because the hash has now changed. However, if your specific use case is to remove duplicates like this, the most efficient method will be a hash. If you’re not using multithreading and this duplicate removal is self contained, you could safely get away with using a hash table (either a dict
like you have or a set
) because the data is effectively immutable. The only risk in that case is adding a __hash__
function that is generally not useful. One option could be to make a thin wrapper class that implements __hash__
, wrap your data in it to remove the duplicates, and then unwrap it, something like this:
class Wrapper(object):
__init__(self, wrapped):
self.wrapped = wrapped
__eq__(self, other):
if not isinstance(other, Wrapper):
return False
return self.wrapped == other.wrapped
__hash__(self):
# custom logic here, e.g.
return 31 * sum(hash(item) for item in self.wrapped)
def without_duplicate_lists(lists):
wrappers = (Wrapper(l) for l in lists)
result = []
seen = set()
for w in wrappers:
if w not in seen:
seen.add(w)
result.append(w)
return [w.wrapped for w in wrappers]
There are some other ways you could do this without computing a hash, such as using a binary search tree (tree map). However, the hash isn’t the inherent problem with storing mutable data in maps, but rather that if you change the key in any way, you’ve lost your value. With hash tables, the key is the hash. With a binary search tree, it’s the ordering, which is defined by the key itself. If the key itself changes, you can’t look it up whatever method you use.
Also, note that in my example calculating the hash is an O(n) operation, so if you are removing duplicates of list
s or set
s or dict
s, it may just be cleaner to convert them (or their keys) to tuple
s or frozenset
s which are immutable and can be used in the set
without worry (same even applies to other objects).
If for some reason there is a middle ground where you think removing duplicates of mutable data via a hash is a bad idea but you still think removing duplicates of mutable data is fine, probably the next most efficient option is to sort the data. This way, instead of O(n^2) for the naive solution of nested iterating, you can sort in O(n*log(n)) time and then iterate through, only keeping items where the current value doesn’t equal the last value. This will require you to implement __eq__
, __gt__
, and __lt__
(or one of the latter and use @total_ordering
):
def without_duplicates_sorted(objs):
objs = sorted(objs)
last = objs[0]
result = [last]
for current in objs[1:]:
if current != last:
result.append(current)
last = current
return result
For completeness’ sake, the naive solution:
def without_duplicates_naive(objs):
result = []
for obj in objs:
if obj not in objs[i+1:]:
result.append(obj1)
return result