Did I reinvent the wheel with this deduplicating function?

Question

I was looking for a set()-like method to deduplicate a list, except that the items figuring in the original list are not hashable (they are dicts).

I spent a while looking for something adequate, and I ended up writing this little function:

def deduplicate_list(lst, key):
    output = []
    keys = []
    for i in lst:
        if not i[key] in keys:
            output.append(i)
            keys.append(i[key])

    return output

Provided that a key is correctly given and is a string, this function does its job pretty well. Needless to say, if I learn about a built-in or a standard library module which allows the same functionality, I'll happily drop my little routine in favor of a more standard and robust choice.

Are you aware of such implementation?

-- Note

The following one-liner found from this answer,

[dict(t) for t in set([tuple(d.items()) for d in l])]

while clever, won't work because I have to work with items as nested dicts.

-- Example

For clarity purposes, here is an example of using such a routine:

with_duplicates = [
    {
        "type": "users",
        "attributes": {
            "first-name": "John",
            "email": "john.smith@gmail.com",
            "last-name": "Smith",
            "handle": "jsmith"
        },
        "id": "1234"
    },
    {
        "type": "users",
        "attributes": {
            "first-name": "John",
            "email": "john.smith@gmail.com",
            "last-name": "Smith",
            "handle": "jsmith"
        },
        "id": "1234"
    }
]

without_duplicates = deduplicate_list(with_duplicates, key='id')

Could you provide a sample call of `deduplicate_list` on your list? (I can't see clearly what does it do) :) — Alexis Clarembeau, Jun 03 '16 at 12:18
look at [this](http://stackoverflow.com/a/8714242/5741205) answer - it might help you to hash elements of your list — MaxU - stand with Ukraine, Jun 03 '16 at 12:23

user2390182 · Answer 1 · 2016-06-03T13:52:43.080

3

You are picking only the first dict in your list for every distinct value for key. itertools.groupby is the built-in tool that can do that for you - sort and group by key and take only the first from each group:

from itertools import groupby

def deduplicate(lst, key):
    fnc = lambda d: d.get(key)  # more robust than d[key]
    return [next(g) for k, g in groupby(sorted(lst, key=fnc), key=fnc)]

edited Jun 03 '16 at 13:52

answered Jun 03 '16 at 12:47

user2390182

72,016
6
67
89

nope, it gives: `TypeError: unorderable types: dict() < dict()` – MaxU - stand with Ukraine Jun 03 '16 at 12:51
Hmm.. it worked for me and the exact list from your example. Python 2.7.11 ubuntu 14.04, I see. So maybe not in Python 3.x. – user2390182 Jun 03 '16 at 12:53
i'm using Python 3.5.1. It's `sorted(l)` - it gives me that error on Python 3 – MaxU - stand with Ukraine Jun 03 '16 at 12:54
also you won't be able to compare the value of a certain key of the dicts – DomTomCat Jun 03 '16 at 12:59
@DomTomCat Yeah, I was looking for a more generic solution. With a given key, it actually should get more robust. Updated! – user2390182 Jun 03 '16 at 13:06

score 0 · Answer 2 · edited May 23 '17 at 12:14

This answer will help to solve a more generic problem - find unique elements not by a single attribute (id in your case), but if any nested attribute differs

The following code will return a list of indices of the unique elements

import copy

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that contains
  only other hashable types (including any lists, tuples, sets, and
  dictionaries).
  """

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

l = [
    {
        "type": "users",
        "attributes": {
            "first-name": "John",
            "email": "john.smith@gmail.com",
            "last-name": "Smith",
            "handle": "jsmith"
        },
        "id": "1234"
    },
    {
        "type": "users",
        "attributes": {
            "first-name": "AAA",
            "email": "aaa.aaah@gmail.com",
            "last-name": "XXX",
            "handle": "jsmith"
        },
        "id": "1234"
    },
    {
        "type": "users",
        "attributes": {
            "first-name": "John",
            "email": "john.smith@gmail.com",
            "last-name": "Smith",
            "handle": "jsmith"
        },
        "id": "1234"
    },
]

# get indicies of unique elements
In [254]: list({make_hash(x):i for i,x in enumerate(l)}.values())
Out[254]: [1, 2]

Do not use hashes to determine if two objects are the same. You always have the risk of hash collisions. — Rob, Jun 03 '16 at 12:57
@Rob, doesn't the `set` implementation use hash (C or Cython implementation of this function) for generating "virtual" keys? And i guess it's a common practice to use `set(lst)` - for deduplicating lists... — MaxU - stand with Ukraine, Jun 03 '16 at 13:14
`set` uses hashes indeed, but only to make the implementation faster. If two objects have the same hash but different values, they will still be separate in a `set`, but not in this implementation. — Rob, Jun 03 '16 at 13:58

score 0 · Answer 3 · answered Jun 03 '16 at 12:36

You could try a short version which is based on the link to the answer, which you provided in the question:

key = "id"
deduplicated = [val for ind, val in enumerate(l)
                if val[key] not in [tmp[key] for tmp in l[ind + 1:]]]
print(deduplicated)

Note, this will take the last element of duplicates

score 0 · Answer 4 · answered Jun 19 '16 at 20:23

In your example, the value returned by the key is hashable. If this is always the case, then use this:

def deduplicate(lst, key):
    return list({item[key]: item for item in lst}.values())

If there are duplicates, only the last matching duplicate is retained.

Did I reinvent the wheel with this deduplicating function?

4 Answers4