0

When I have a list of immutable objects, lst and want to get rid of duplicates, I can just use set(lst):

lst = [0,4,2,6,3,6,4,9,2,2] # integers are immutable in python
print(set(lst)) # {0,2,3,4,6,9}

However suppose I have a list of mutable objects, lst and want to get rid of duplicates. set(lst) won't work because mutable objects are not hashable - we'd get a TypeError: unhashable type: '<type>'. What should we do in this case?

For example, suppose we have lst, a list of dicts (dicts are mutable and thus not hashable) and some dicts occur multiple times in lst:

d0 = {0:'a', 1:'b', 9:'j'}
d1 = {'jan':1, 'jul':7, 'dec':12}
d2 = {'hello':'hola', 'goodbye':'adios', 'happy':'feliz', 'sad':'triste'}
lst = [d0, d1, d1, d0, d2, d1, d0]

We want to iterate through lst, but only consider each dict once. If we do set(lst), we'd get a TypeError: unhashable type: 'dict'. Instead we have to do something like:

def dedup(lst):
  seen_ids = set()
  for elem in lst:
    id_ = id(elem)
    if id_ not in seen_ids:
      seen_ids.add(id_)
      yield elem

Is there a better way to do this???

joseville
  • 685
  • 3
  • 16
  • It depends. There could also be two dictionaries which are identical, but are not the same instance, so they have different ids. What would you want to do with them? – zvone Nov 06 '22 at 12:38
  • BTW, why do you even have a list of dictionaries, some of which are the same dictionary? I can't imagine a scenario in which that is a good idea. – zvone Nov 06 '22 at 12:39
  • Do the answers to this [question](https://stackoverflow.com/questions/41704229/remove-duplicates-from-list-of-dicts) help at all? – quamrana Nov 06 '22 at 13:12
  • What should happen if d1 and d2 are equal (but still have different ids)? Depending on this it maybe is a duplicate question of https://stackoverflow.com/questions/11092511/list-of-unique-dictionaries or https://stackoverflow.com/questions/54964209/take-unique-values-out-of-a-list-with-unhashable-elements – mkrieger1 Nov 06 '22 at 13:42
  • @zvone It came up in a failed solution to a one-pass, no dfs/bfs solution to LC695: Max Area of Island. [This](https://replit.com/@joseville/LC695-Max-Area-of-Island#main.py) is the failed solution. Label each island with a number. If at any point, it turns out that two islands are actually the same then we "merge" them. To do this, keep `groups` a list of sets (sets are immutable) where `groups[i]` is the set of islands that island i is merged with. – joseville Nov 06 '22 at 14:07
  • For example, `groups[2] == s0` where `s0 == {2,5,6}` means islands `2`, `5`, and `6` are merged. And ideally we'd also have `groups[5] == s0` and `groups[6] == s0`, so that if we find we have to merge any of `2`, `5`, or `6` with a fourth island, say `8`, then we'd update `s0 == {2,5,6,8}` and `groups[8] = s0`. There are better ways to solve this problem. I was just intrigued by it being a one-pass solution that doesn't use bfs nor dfs. – joseville Nov 06 '22 at 14:09

3 Answers3

3

Working with object ids is dangerous. It doesn't work anymore if you have two dicts with the same content. You can use the json module to serialize the dict as strings.

import json

def dedup(lst):
    myset = {json.dumps(x) for x in lst}  # serialize to a set
    return [json.loads(x) for x in myset]  # deserialize to the list
0x0fba
  • 1,520
  • 1
  • 1
  • 11
  • 3
    It's not clear from the question yet if one or the other behavior is wanted. – mkrieger1 Nov 06 '22 at 13:45
  • 1
    According to the question the matter is to find a solution for non-hashable objects (dicts in the example). If they were hashable two objects with the same content would have the same hash therefore would be equals. In my understanding the prefered behavior is that the solution must consider two objects as equals if their content are equals. – 0x0fba Nov 06 '22 at 13:58
  • 1
    `d0 = {0:'a', 1:'b', 9:'j'} d3 = {0:'a', 1:'b', 9:'j'} id(d0) == id(d3) # False d0 == d3 # True` – 0x0fba Nov 06 '22 at 14:03
1

Well, you could use dict with id as a key. It's roughly the same idea, but allows one-liner like this.

list({id(x):x for x in lst}.values())
chrslg
  • 9,023
  • 5
  • 17
  • 31
  • Doesn't work anymore if you add to `lst` a `d3` dict with the same content as `d0`. – 0x0fba Nov 06 '22 at 13:32
  • @fbattello Indeed. But that seemed to me to be a postulate of the question. I see that, in another comment, that you understood the question has a deep comparison. I did not. Tho I admit it is not clear. And I am still not sure after last comments (they seem to indicate, that, indeed, we want to remove duplicate based on their content. But not even sure if those are islands, or list of index or...). But indeed, if we want to remove duplicates from content, your answer is the one (tho it wouldn't work if there are cycles or things like that). If not, then it is too slow. – chrslg Nov 06 '22 at 14:21
-1

This method use 2 lists: a list of ids and a list of elements.

d0 = {0:'a', 1:'b', 9:'j'}
d1 = {'jan':1, 'jul':7, 'dec':12}
d2 = {'hello':'hola', 'goodbye':'adios', 'happy':'feliz', 'sad':'triste'}
lst = [d0, d1, d1, d0, d2, d1, d0]

def dedup( lst):
  id_seen = []
  elems = []
  for elem in lst:
    if id(elem) not in id_seen:
      id_seen.append( id(elem))
      elems.append( elem)
  return elems

print( dedup( lst))
user3435121
  • 633
  • 4
  • 13