1

I have a dictionary of dictionaries, a sample is below:

my_dictionary = {
    "0": {"Name": "Nick", "Age": 39, "Country": "UK"},
    "1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
    "2": {"Name": "Dave", "Age": 23, "Country": "UK"},
    "3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
    "4": {"Name": "Nick", "Age": 39, "Country": "France"},
}

I want to remove duplicates in my_dictonary if the value in "Name" AND "Age" is the same. It does not matter which one is removed (there could be many that are the same, I only want one version to remain though).

So in our example above, the output would be:

{'0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'},
 '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
 '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}

As Nick, 39 was duplicated despite having a different country.

Is there an easy/efficient way of doing this? I have several million rows.

funnydman
  • 9,083
  • 4
  • 40
  • 55
Nicholas
  • 3,517
  • 13
  • 47
  • 86
  • Do you need to keep the index keys of the outer dictionary or could the output also be a list of unique dicts? Did you already had a look at [this question](https://stackoverflow.com/questions/9427163/remove-duplicate-dict-in-list-in-python)? Alternatively, I could imagine using a functional approach. Nevertheless, I do not know about the performance of each approach. – albert Sep 07 '22 at 09:16
  • The index key does not matter, although I do need something there if that makes sense? i.e. it doesn't need to be re-indexed, it could go (1, 2, 8, 9, 11) etc. – Nicholas Sep 07 '22 at 09:17
  • 1
    You can do a first go over all values to detect the first occurrences of all (name, age) pairs. During this first iteration you can fill both a set with already encountered pairs and another set with the keys of the first occurrences of all pairs. From the set of keys you can simply build your desired dictionary. – cglacet Sep 07 '22 at 09:20
  • Thats a great way of doing it, thank you cglacet! – Nicholas Sep 07 '22 at 09:24

2 Answers2

2

Track seen records, for example:

my_dictionary = {
    "0": {"Name": "Nick", "Age": 39, "Country": "UK"},
    "1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
    "2": {"Name": "Dave", "Age": 23, "Country": "UK"},
    "3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
    "4": {"Name": "Nick", "Age": 39, "Country": "France"},
}

seen = set()
result = {}
for k, v in my_dictionary.items():
    if (v['Name'], v['Age']) not in seen:
        result[k] = v
        seen.add((v['Name'], v['Age']))

print(result)

Output:

{
    '0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'}, 
    '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}, 
    '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}
}

Edit note: Using set() (which uses a hash-table) for tracking leads to the overall complexity of O(n) for n rows.

Vojtěch Chvojka
  • 378
  • 1
  • 15
funnydman
  • 9,083
  • 4
  • 40
  • 55
1

Twice dictionary comprehension, this is easier to write, but it will be slower than using set.

>>> {(v['Name'], v['Age']): k for k, v in my_dictionary.items()}
{('Nick', 39): '4', ('Steve', 19): '1', ('Dave', 23): '2'}
>>> {k: my_dictionary[k] for k in _.values()}
{'4': {'Name': 'Nick', 'Age': 39, 'Country': 'France'},
 '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
 '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}
Mechanic Pig
  • 6,756
  • 3
  • 10
  • 31