Remove duplicate dict in list in Python

Question

I have a list of dicts, and I'd like to remove the dicts with identical key and value pairs.

For this list: [{'a': 123}, {'b': 123}, {'a': 123}]

I'd like to return this: [{'a': 123}, {'b': 123}]

Another example:

For this list: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

I'd like to return this: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Can you tell us more about the actual problem you're trying to solve? This seems like an odd problem to have. — gfortune, Feb 24 '12 at 07:50
I am combining a few lists of dicts and there are duplicates. So I need to remove those duplicates. — Brenden, Feb 24 '12 at 07:51
I found a solution in http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order in an answer without the usage of ```set()``` — Sebastian Wagner, Jun 13 '16 at 10:37
@gfortune I encountered this problem in real life with a large ETL script that queues data for upload as a list of dicts. Sometimes multiple records from Scope A will bring in the same records from Scope B, but no need to upload redundant output to the external system. — bendodge, Jan 19 '22 at 22:49
https://stackoverflow.com/a/23358757/10413550 use this answer if you are looking for fastest way — SKJ, Dec 14 '22 at 08:59

score 409 · Accepted Answer · edited Jul 17 '18 at 15:26

409

Try this:

[dict(t) for t in {tuple(d.items()) for d in l}]

The strategy is to convert the list of dictionaries to a list of tuples where the tuples contain the items of the dictionary. Since the tuples can be hashed, you can remove duplicates using set (using a set comprehension here, older python alternative would be set(tuple(d.items()) for d in l)) and, after that, re-create the dictionaries from tuples with dict.

where:

l is the original list
d is one of the dictionaries in the list
t is one of the tuples created from a dictionary

Edit: If you want to preserve ordering, the one-liner above won't work since set won't do that. However, with a few lines of code, you can also do that:

l = [{'a': 123, 'b': 1234},
        {'a': 3222, 'b': 1234},
        {'a': 123, 'b': 1234}]

seen = set()
new_l = []
for d in l:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)

print new_l

Example output:

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Note: As pointed out by @alexis it might happen that two dictionaries with the same keys and values, don't result in the same tuple. That could happen if they go through a different adding/removing keys history. If that's the case for your problem, then consider sorting d.items() as he suggests.

edited Jul 17 '18 at 15:26

Jean-François Fabre

137,073
23
153
219

answered Feb 24 '12 at 07:51

jcollado

39,419
8
102
133

59

Nice solution but it has a bug: `d.items()` is not guaranteed to return elements in a particular order. You should do `tuple(sorted(d.items()))` to ensure you don't get different tuples for the same key-value pairs. – alexis Feb 24 '12 at 14:58
2

@alexis I made a few tests and you are indeed right. If a lot of keys are added in between and removed later, then that could be the case. Thanks a lot for your comment. – jcollado Feb 24 '12 at 15:53
Cool. I added the fix to your answer for the benefit of future readers who might not read the whole conversation. – alexis Feb 24 '12 at 21:46
Doesn't need *"a lot of keys"* to fail, already fails `[{'a': 1, 'i': 2}, {'i': 2, 'a': 1}]`. – Stefan Pochmann May 31 '16 at 14:24
3

Note, this will not work if you load in that list of dicts from a the `json` module as I did – Dhruv Ghulati Jul 25 '16 at 08:19
I can't seem to make this work for me and I imported my data from the `json` module – nodox Jul 26 '16 at 15:32
This will throw an error if you have an `OrderedDict` as a value – Dean Christian Armada Aug 10 '17 at 10:35
9

This is a valid solution in this case, but won't work in case of nested dictionaries – Lorenzo Belli Jan 26 '18 at 13:06
5

It says "TypeError: unhashable type: 'list'" for the step "if t not in seen:" – Vreddhi Bhat Mar 15 '19 at 08:04
@jcollado Is there a way to preserve dict with most keys. For example "a" key is the same for two dicts, but one of them have more keys ? – Stefan Jul 23 '20 at 22:01

Emmanuel · Answer 2 · 2012-02-24T09:10:56.860

91

Another one-liner based on list comprehensions:

>>> d = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> [i for n, i in enumerate(d) if i not in d[n + 1:]]
[{'b': 123}, {'a': 123}]

Here since we can use dict comparison, we only keep the elements that are not in the rest of the initial list (this notion is only accessible through the index n, hence the use of enumerate).

edited Feb 24 '12 at 09:10

answered Feb 24 '12 at 09:05

Emmanuel

13,935
12
50
72

7

This also works for a list of dictionaries which consist of lists as compared the the first answer – gbozee Dec 02 '15 at 08:09
7

this also works when you may have an unhashable type as a value in your dictionaries, unlike the top answer. – Steve Rossiter Feb 01 '16 at 12:43
1

here, purpose is to remove duplicate values, not key, see this answer's code – Jamil Noyda Oct 04 '18 at 09:40
3

This is very inefficient code. `if i not in d[n + 1:]` iterates over the entire list of dicts (from `n` but that just halves the total number of operations) and you're doing that check for every element in your dictionary so this this code is O(n^2) time complexity – Boris Verkhovskiy May 14 '20 at 18:37
doesn't work for dictionaries with dictionaries as values – Roko Mijic Jun 04 '20 at 11:11
Works for unhashable types: lists as values in a dictionary from the top list of dictionaries. – IAmBotmaker Oct 02 '22 at 19:28

score 61 · Answer 3 · answered Jul 17 '18 at 19:43

If using a third-party package would be okay then you could use iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]

It preserves the order of the original list and ut can also handle unhashable items like dictionaries by falling back on a slower algorithm (O(n*m) where n are the elements in the original list and m the unique elements in the original list instead of O(n)). In case both keys and values are hashable you can use the key argument of that function to create hashable items for the "uniqueness-test" (so that it works in O(n)).

In the case of a dictionary (which compares independent of order) you need to map it to another data-structure that compares like that, for example frozenset:

>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]

Note that you shouldn't use a simple tuple approach (without sorting) because equal dictionaries don't necessarily have the same order (even in Python 3.7 where insertion order - not absolute order - is guaranteed):

>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False

And even sorting the tuple might not work if the keys aren't sortable:

>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'

Benchmark

I thought it might be useful to see how the performance of these approaches compares, so I did a small benchmark. The benchmark graphs are time vs. list-size based on a list containing no duplicates (that was chosen arbitrarily, the runtime doesn't change significantly if I add some or lots of duplicates). It's a log-log plot so the complete range is covered.

The absolute times:

The timings relative to the fastest approach:

The second approach from thefourtheye is fastest here. The unique_everseen approach with the key function is on the second place, however it's the fastest approach that preserves order. The other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without key and the solutions from Emmanuel and Scorpil are very slow for longer lists and behave much worse O(n*n) instead of O(n). stpks approach with json isn't O(n*n) but it's much slower than the similar O(n) approaches.

The code to reproduce the benchmarks:

from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen

def jcollado_1(l):
    return [dict(t) for t in {tuple(d.items()) for d in l}]

def jcollado_2(l):
    seen = set()
    new_l = []
    for d in l:
        t = tuple(d.items())
        if t not in seen:
            seen.add(t)
            new_l.append(d)
    return new_l

def Emmanuel(d):
    return [i for n, i in enumerate(d) if i not in d[n + 1:]]

def Scorpil(a):
    b = []
    for i in range(0, len(a)):
        if a[i] not in a[i+1:]:
            b.append(a[i])

def stpk(X):
    set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
    return [json.loads(t) for t in set_of_jsons]

def thefourtheye_1(data):
    return OrderedDict((frozenset(item.items()),item) for item in data).values()

def thefourtheye_2(data):
    return {frozenset(item.items()):item for item in data}.values()

def iu_1(l):
    return list(unique_everseen(l))

def iu_2(l):
    return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))

funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')

%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'

b.plot(relative_to=thefourtheye_2)

For completeness here is the timing for a list containing only duplicates:

# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}

The timings don't change significantly except for unique_everseen without key function, which in this case is the fastest solution. However that's just the best case (so not representative) for that function with unhashable values because it's runtime depends on the amount of unique values in the list: O(n*m) which in this case is just 1 and thus it runs in O(n).

Disclaimer: I'm the author of iteration_utilities.

Can you share the Python version of the `unique_everseen` source code? GitHub only has .c versions. — Sirjon, Jun 12 '23 at 09:53
@Sirjon The code is compiled to a Python C extension. There's not really a Python version. However the code is based on the recipe in https://docs.python.org/3/library/itertools.html#itertools-recipes (just a bit more optimized) — MSeifert, Jun 12 '23 at 16:41

score 33 · Answer 4 · answered Aug 02 '16 at 13:52

33

Other answers would not work if you're operating on nested dictionaries such as deserialized JSON objects. For this case you could use:

import json
set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
X = [json.loads(t) for t in set_of_jsons]

answered Aug 02 '16 at 13:52

stpk

2,015
1
16
23

3

Great! the trick is that dict object cannot be directly added to a set, it needs to be converted to json object by dump(). – Reihan_amn May 08 '19 at 01:00

score 24 · Answer 5 · answered Aug 01 '18 at 13:34

24

If you are using Pandas in your workflow, one option is to feed a list of dictionaries directly to the pd.DataFrame constructor. Then use drop_duplicates and to_dict methods for the required result.

import pandas as pd

d = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

d_unique = pd.DataFrame(d).drop_duplicates().to_dict('records')

print(d_unique)

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

answered Aug 01 '18 at 13:34

jpp

159,742
34
281
339

For future googlers. You might also want to add `astype(str)` after `pd.DataFrame()` -> `pd.DataFrame().astype(str)`. Otherwise you might receive a `TypeError: unhashable type: 'dict'` error. – Dmitriy Zub Jun 15 '22 at 10:26
That's a great solution, If you add details like how this method would perform compared with others mentioned answers in terms of performance. That would be great too. – Faisal Nazik Aug 17 '22 at 13:00

score 22 · Answer 6 · answered Apr 29 '14 at 07:52

22

If you want to preserve the Order, then you can do

from collections import OrderedDict
print OrderedDict((frozenset(item.items()),item) for item in data).values()
# [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

If the order doesn't matter, then you can do

print {frozenset(item.items()):item for item in data}.values()
# [{'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

answered Apr 29 '14 at 07:52

thefourtheye

233,700
52
457
497

1

Note: in python 3, your second approach gives a non-serializable `dict_values` output instead of a list. You have to cast the whole thing in a list again. `list(frozen.....)` – saran3h Jul 25 '19 at 06:04
why this is not marked as correct answer? this method is faster and reduces time from 40 min to 1 min for me. thx – SKJ Dec 14 '22 at 09:03
1

**//why this is not marked as correct answer//** because the person who asked the question is the one who gets to decide ... if this is correct for your use, you can still up-vote this answer, even though you were not the one who originally asked the question – dreftymac Feb 09 '23 at 16:24

score 22 · Answer 7 · edited Jun 15 '22 at 11:13

22

Sometimes old-style loops are still useful. This code is little longer than jcollado's, but very easy to read:

a = [{'a': 123}, {'b': 123}, {'a': 123}]
b = []
for i in range(len(a)):
    if a[i] not in a[i+1:]:
        b.append(a[i])

edited Jun 15 '22 at 11:13

Luatic

8,513
2
13
34

answered Feb 24 '12 at 08:10

Scorpil

1,422
1
11
14

3

The `0`in `range(0, len(a))` is not necessary. – Juan Antonio Feb 08 '18 at 18:47

Highstaker · Answer 8 · 2018-06-14T09:22:55.877

8

Not a universal answer, but if your list happens to be sorted by some key, like this:

l=[{'a': {'b': 31}, 't': 1},
   {'a': {'b': 31}, 't': 1},
 {'a': {'b': 145}, 't': 2},
 {'a': {'b': 25231}, 't': 2},
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 112}, 't': 3}]

then the solution is as simple as:

import itertools
result = [a[0] for a in itertools.groupby(l)]

Result:

[{'a': {'b': 31}, 't': 1},
{'a': {'b': 145}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 112}, 't': 3}]

Works with nested dictionaries and (obviously) preserves order.

edited Jun 14 '18 at 09:22

answered Jun 14 '18 at 07:49

Highstaker

1,015
2
12
28

This works even with a dictionary with a list in it. – Eric Oct 06 '20 at 08:14

score 2 · Answer 9 · answered Feb 21 '21 at 02:21

Easiest way, convert each item in the list to string, since dictionary is not hashable. Then you can use set to remove the duplicates.

list_org = [{'a': 123}, {'b': 123}, {'a': 123}]
list_org_updated = [ str(item) for item in list_org]
print(list_org_updated)
["{'a': 123}", "{'b': 123}", "{'a': 123}"]
unique_set = set(list_org_updated)
print(unique_set)
{"{'b': 123}", "{'a': 123}"}

You can use the set, but if you do want a list, then add the following:

import ast
unique_list = [ast.literal_eval(item) for item in unique_set]
print(unique_list)
[{'b': 123}, {'a': 123}]

score 2 · Answer 10 · answered Oct 16 '21 at 07:44

2

Remove duplications by custom key:

def remove_duplications(arr, key):
    return list({key(x): x for x in arr}.values())

answered Oct 16 '21 at 07:44

BaiJiFeiLong

3,716
1
30
28

score 2 · Answer 11 · edited Jul 14 '23 at 08:45

Input

input_list = [**{'a': 123, 'b': 1234}**, {'a': 3222, 'b': 1234}, **{'a': 123, 'b': 1234}**]

Output Required

>>> [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Code

list = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

empty_list = []

for item in list:
    if item not in empty_list:
        empty_list.append(item)

print("previous list = ",list)
print("Updated list = ",empty_list)

Output

>>> previous list = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
>>> Updated list = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

score 2 · Answer 12 · answered Feb 24 '12 at 08:03

2

You can use a set, but you need to turn the dicts into a hashable type.

seq = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
unique = set()
for d in seq:
    t = tuple(d.iteritems())
    unique.add(t)

Unique now equals

set([(('a', 3222), ('b', 1234)), (('a', 123), ('b', 1234))])

To get dicts back:

[dict(x) for x in unique]

answered Feb 24 '12 at 08:03

Matimus

512
2
8

1

The order of `d.iteritems()` isn't guaranteed - so you may end up with 'duplicates' in `unique`. – danodonovan Oct 02 '19 at 12:22

score 1 · Answer 13 · edited Mar 03 '20 at 09:34

1

Not so short but easy to read:

list_of_data = [{'a': 123}, {'b': 123}, {'a': 123}]

list_of_data_uniq = []
for data in list_of_data:
    if data not in list_of_data_uniq:
        list_of_data_uniq.append(data)

Now, list list_of_data_uniq will have unique dicts.

edited Mar 03 '20 at 09:34

Georgy

12,464
7
65
73

answered Nov 17 '19 at 09:59

user1723157

115
1
2

score 1 · Answer 14 · answered Feb 14 '20 at 06:37

Here's a quick one-line solution with a doubly-nested list comprehension (based on @Emmanuel 's solution).

This uses a single key (for example, a) in each dict as the primary key, rather than checking if the entire dict matches

[i for n, i in enumerate(list_of_dicts) if i.get(primary_key) not in [y.get(primary_key) for y in list_of_dicts[n + 1:]]]

It's not what OP asked for, but it's what brought me to this thread, so I figured I'd post the solution I ended up with

score 0 · Answer 15 · answered Feb 17 '21 at 16:42

A lot of good examples searching for duplicate values and keys, below is the way we filter out whole dictionary duplicate data in lists. Use dupKeys = [] if your source data is comprised of EXACT formatted dictionaries and looking for duplicates. Otherwise set dupKeys = to the key names of the data you want to not have duplicate entries of, can be 1 to n keys. It aint elegant, but works and is very flexible

import binascii

collected_sensor_data = [{"sensor_id":"nw-180","data":"XXXXXXX"},
                         {"sensor_id":"nw-163","data":"ZYZYZYY"},
                         {"sensor_id":"nw-180","data":"XXXXXXX"},
                         {"sensor_id":"nw-97", "data":"QQQQQZZ"}]

dupKeys = ["sensor_id", "data"]

def RemoveDuplicateDictData(collected_sensor_data, dupKeys):

    checkCRCs = []
    final_sensor_data = []
    
    if dupKeys == []:
        for sensor_read in collected_sensor_data:
            ck1 = binascii.crc32(str(sensor_read).encode('utf8'))
            if not ck1 in checkCRCs:
                final_sensor_data.append(sensor_read)
                checkCRCs.append(ck1)
    else:
        for sensor_read in collected_sensor_data:
            tmp = ""
            for k in dupKeys:
                tmp += str(sensor_read[k])

            ck1 = binascii.crc32(tmp.encode('utf8'))
            if not ck1 in checkCRCs:
                final_sensor_data.append(sensor_read)
                checkCRCs.append(ck1)
  
           
    return final_sensor_data    

 final_sensor_data = [{"sensor_id":"nw-180","data":"XXXXXXX"},
                      {"sensor_id":"nw-163","data":"ZYZYZYY"},
                      {"sensor_id":"nw-97", "data":"QQQQQZZ"}]

score 0 · Answer 16 · answered Mar 13 '22 at 14:29

If you don't care about scale and crazy performance, simple func:

# Filters dicts with the same value in unique_key
# in: [{'k1': 1}, {'k1': 33}, {'k1': 1}]
# out: [{'k1': 1}, {'k1': 33}]
def remove_dup_dicts(list_of_dicts: list, unique_key) -> list:
    unique_values = list()
    unique_dicts = list()
    for obj in list_of_dicts:
        val = obj.get(unique_key)
        if val not in unique_values:
            unique_values.append(val)
            unique_dicts.append(obj)
    return unique_dicts

Remove duplicate dict in list in Python

16 Answers16

Benchmark

Linked

Related