1

It's simply to remove duplicates in a list of immutable objects by simply keeping a memo of seen objects.

nums = [1, 1, 2, 3, 4, 4]
duplicates_removed
seen = dict()
for n in nums:
    if n not in seen:
        duplicates_removed.append(n)
        seen[n] = True

But for mutable objects, you cannot hash them into a dictionary. What is an elegant way to quickly remove duplicates (defined by some custom logic in __eq__ of the object class) in a list?

  • 1
    When would you consider two mutable objects the "same"? Would two lists, both containing `[1, 2]` be the same? Or would you only consider them the same if both are references to the exact same list? – Grismar Sep 12 '20 at 05:01

4 Answers4

1

You can store mutable objects in hash tables, you just probably shouldn’t because if you have a pointer to an object, you store the object in the hash table, you mutate the object, and then you attempt to look it up via the pointer, you (almost certainly) won’t find it because the hash has now changed. However, if your specific use case is to remove duplicates like this, the most efficient method will be a hash. If you’re not using multithreading and this duplicate removal is self contained, you could safely get away with using a hash table (either a dict like you have or a set) because the data is effectively immutable. The only risk in that case is adding a __hash__ function that is generally not useful. One option could be to make a thin wrapper class that implements __hash__, wrap your data in it to remove the duplicates, and then unwrap it, something like this:

class Wrapper(object):
    __init__(self, wrapped):
        self.wrapped = wrapped

    __eq__(self, other):
        if not isinstance(other, Wrapper):
            return False
        return self.wrapped == other.wrapped

    __hash__(self):
        # custom logic here, e.g.
        return 31 * sum(hash(item) for item in self.wrapped)

def without_duplicate_lists(lists):
    wrappers = (Wrapper(l) for l in lists)
    result = []
    seen = set()
    for w in wrappers:
        if w not in seen:
            seen.add(w)
            result.append(w)
    return [w.wrapped for w in wrappers]

There are some other ways you could do this without computing a hash, such as using a binary search tree (tree map). However, the hash isn’t the inherent problem with storing mutable data in maps, but rather that if you change the key in any way, you’ve lost your value. With hash tables, the key is the hash. With a binary search tree, it’s the ordering, which is defined by the key itself. If the key itself changes, you can’t look it up whatever method you use.

Also, note that in my example calculating the hash is an O(n) operation, so if you are removing duplicates of lists or sets or dicts, it may just be cleaner to convert them (or their keys) to tuples or frozensets which are immutable and can be used in the set without worry (same even applies to other objects).

If for some reason there is a middle ground where you think removing duplicates of mutable data via a hash is a bad idea but you still think removing duplicates of mutable data is fine, probably the next most efficient option is to sort the data. This way, instead of O(n^2) for the naive solution of nested iterating, you can sort in O(n*log(n)) time and then iterate through, only keeping items where the current value doesn’t equal the last value. This will require you to implement __eq__, __gt__, and __lt__ (or one of the latter and use @total_ordering):

def without_duplicates_sorted(objs):
    objs = sorted(objs)
    last = objs[0]
    result = [last]
    for current in objs[1:]:
        if current != last:
            result.append(current)
        last = current
    return result

For completeness’ sake, the naive solution:

def without_duplicates_naive(objs):
    result = []
    for obj in objs:
        if obj not in objs[i+1:]:
            result.append(obj1)
    return result
dantiston
  • 5,161
  • 2
  • 26
  • 30
0

I don't know it it's elegant, but you can often convert non-hashable items to hashable object like a frozenset or tuple. For example, you can convert the items() of dict like this:

nums = [{'a':1}, {'a':1}, {'a':2, 'c':3}, {'a':3}, {'c':3, 'a':2}, {'b':4}, {'a':4}]

duplicates_removed = []
seen = set()

for n in nums:
    n_s = frozenset(n.items())
    if n_s not in seen:
        duplicates_removed.append(n)
        seen.add(n_s)

duplicates_removed
# [{'a': 1}, {'a': 2, 'c': 3}, {'a': 3}, {'b': 4}, {'a': 4}]
Mark
  • 90,562
  • 7
  • 108
  • 148
0

We can compare each element with the previous one.If they are not same we add it to the new list containing unique values.

def remove_dups(nums):

    duplicates_removed = []

    duplicates_removed.append(nums[0])

    for i in range(1,len(nums)):

        if(nums[i]!=nums[i-1]):

            duplicates_removed.append(nums[i])

lst1 = [1, 1, 2, 3, 4, 4]
remove_dups(lst1)
// Output:  [1, 2, 3, 4]

lst2 = [{'a':1}, {'a':1}, {'c':3}, {'d':4, 'a':1}, {'d':4, 'a': 1}]
remove_dups(lst2)
// Output:  [{'a': 1}, {'c': 3}, {'a': 1, 'd': 4}]
ajay.16nk
  • 92
  • 3
0

You don't need a dict.

nums = [1, 1, 2, 3, 4, 4]
duplicates_removed = []
for n in nums:
    if n not in duplicates_removed:
        duplicates_removed.append(n)

The same works for custom classes

class X:
    def __init__(self, n):
        self.n = n

    def __eq__(self, cmp):
        return cmp.n == self.n

    def __ne__(self, cmp):
        return cmp.n != self.n

    def __repr__(self):
        return f"X({self.n})"


nums = [X(1), X(1), X(2), X(4), X(4), X(5)]
nodups = []

for n in nums:
    if n not in nodups:
        nodups.append(n)

print(nodups)
Mad Wombat
  • 14,490
  • 14
  • 73
  • 109