Dictionary of dictionaries - Removing duplications in only certain keys/values

Question

I have a dictionary of dictionaries, a sample is below:

my_dictionary = {
    "0": {"Name": "Nick", "Age": 39, "Country": "UK"},
    "1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
    "2": {"Name": "Dave", "Age": 23, "Country": "UK"},
    "3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
    "4": {"Name": "Nick", "Age": 39, "Country": "France"},
}

I want to remove duplicates in my_dictonary if the value in "Name" AND "Age" is the same. It does not matter which one is removed (there could be many that are the same, I only want one version to remain though).

So in our example above, the output would be:

{'0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'},
 '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
 '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}

As Nick, 39 was duplicated despite having a different country.

Is there an easy/efficient way of doing this? I have several million rows.

Do you need to keep the index keys of the outer dictionary or could the output also be a list of unique dicts? Did you already had a look at [this question](https://stackoverflow.com/questions/9427163/remove-duplicate-dict-in-list-in-python)? Alternatively, I could imagine using a functional approach. Nevertheless, I do not know about the performance of each approach. — albert, Sep 07 '22 at 09:16
The index key does not matter, although I do need something there if that makes sense? i.e. it doesn't need to be re-indexed, it could go (1, 2, 8, 9, 11) etc. — Nicholas, Sep 07 '22 at 09:17
You can do a first go over all values to detect the first occurrences of all (name, age) pairs. During this first iteration you can fill both a set with already encountered pairs and another set with the keys of the first occurrences of all pairs. From the set of keys you can simply build your desired dictionary. — cglacet, Sep 07 '22 at 09:20

score 2 · Accepted Answer · edited Sep 07 '22 at 09:27

Track seen records, for example:

my_dictionary = {
    "0": {"Name": "Nick", "Age": 39, "Country": "UK"},
    "1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
    "2": {"Name": "Dave", "Age": 23, "Country": "UK"},
    "3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
    "4": {"Name": "Nick", "Age": 39, "Country": "France"},
}

seen = set()
result = {}
for k, v in my_dictionary.items():
    if (v['Name'], v['Age']) not in seen:
        result[k] = v
        seen.add((v['Name'], v['Age']))

print(result)

Output:

{
    '0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'}, 
    '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}, 
    '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}
}

Edit note: Using set() (which uses a hash-table) for tracking leads to the overall complexity of O(n) for n rows.

This is great, thank you funnydman! Thats such a good way of doing it, ty :) — Nicholas, Sep 07 '22 at 09:23

Mechanic Pig · Answer 2 · 2022-09-07T09:26:09.440

1

Twice dictionary comprehension, this is easier to write, but it will be slower than using set.

>>> {(v['Name'], v['Age']): k for k, v in my_dictionary.items()}
{('Nick', 39): '4', ('Steve', 19): '1', ('Dave', 23): '2'}
>>> {k: my_dictionary[k] for k in _.values()}
{'4': {'Name': 'Nick', 'Age': 39, 'Country': 'France'},
 '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
 '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}

edited Sep 07 '22 at 09:26

answered Sep 07 '22 at 09:24

Mechanic Pig

6,756
3
10
31

Great! Thank you Mechanic Pig! :) – Nicholas Sep 07 '22 at 09:25
1

If you reverse the first iteration order it will get the desired output. – cglacet Sep 07 '22 at 09:27
@cglacet OP doesn't care which one to delete, but this is still a good suggestion :) – Mechanic Pig Sep 07 '22 at 09:29

Dictionary of dictionaries - Removing duplications in only certain keys/values

2 Answers2