3

I have a list of dictionaries sorted by a specific key. Each dictionary contains 32 elements, and there over 4000 dictionaries in the list. I need code to work through the list and return a new list with all duplicates removed.

The methods from these links:

Don't help me because a dictionary is unhashable.

Any thoughts? If you need more info, comment and I will add the info.

Edit:

A duplicate dictionary would be any two dictionaries with the same values for list[dictionary][key].


Ok, here is a detailed explanation for those who need it.

I have a list of dictionaries like thus:

[ {
    "ID" : "0001",
    "Organization" : "SolarUSA",
    "Matchcode" : "SolarUSA, Something Street, Somewhere State, Whatev Zip",
    "Owner" : "Timothy Black",
   }, {
    "ID" : "0002",
    "Organization" : "SolarUSA",
    "Matchcode" : "SolarUSA, Something Street, Somewhere State, Whatev Zip",
    "Owner" : "Johen Wilheim",
   }, {
    "ID" : "0003",
    "Organization" : "Zapotec",
    "Matchcode" : "Zapotec, Something Street, Somewhere State, Whatev Zip",
    "Owner" : "Simeon Yurrigan",
   } ]

Of this list, the first and second dictionaries are duplicates because their Matchcodes are identical.

Now this list is sorted by the following code:

# sort_by is "Matchcode"
def sort( list_to_be_sorted, sort_by ):
    return sorted(list_to_be_sorted, key=lambda k: k[sort_by])

So I have a neat list of dictionaries sorted by Matchcode. Now I just need to iterate over the list, accessing the list[dictionary][key] and deleting duplicates when two key values match.

Community
  • 1
  • 1
Jacob Bridges
  • 725
  • 2
  • 8
  • 16
  • you need to define what constitutes a "duplicate" dictionary since they are not hashable. Same keys? same keys and values? Same keys AND those keys mapped to the same values? – Cameron Sparr Aug 28 '13 at 20:55
  • @CameronSparr I edited my question to address your concern. Thanks – Jacob Bridges Aug 28 '13 at 21:02
  • @Jacob-IT, I am not positive I understand you. Are you saying that if `key = 3` and your list had both `{3: 4, 5:6, 7:None}` and `{3:4, 5:8, 9:None, 10:1000}` in it, they would be duplicates? – Mike Graham Aug 28 '13 at 21:06
  • 3
    What does "any dictionary with the same value as `key`" mean? You mean the same value for every key? The same value at some `key` that you've specified externally? – abarnert Aug 28 '13 at 21:07
  • 2
    It looks like you are using dictionaries as as a poor man's object. (Nothing wrong with that.) This problem may be a hint that you should create a class for this data. Then you can define `__eq__`, `__hash__` and those other answers wold apply. – Steven Rumbalski Aug 28 '13 at 21:07
  • @Jacob-IT Still not sure what you mean, do you mean that all the dictionaries have `key`, and if two dictionaries have the same value for `key`, then they are duplicate? – Cameron Sparr Aug 28 '13 at 21:07
  • also, try to refrain from extra info in your question posts, is the fact that it's sorted relevant to this question? Does the sorting need to be maintained? – Cameron Sparr Aug 28 '13 at 21:10
  • 2
    Can you post an example of, say, three of these dictionaries, two of which should be considered duplicates? It's still very unclear what the criteria are for dictionaries to be considered duplicates. – Henry Keiter Aug 28 '13 at 21:12
  • 1
    @MikeGraham: Yes, in that case they would be duplicates because the values of the key ("3" in this case) are identical. – Jacob Bridges Aug 28 '13 at 21:29
  • An alternative to preserving order would be to sort your result rather than your input. I'm not sure what the performance implications of that would be. – Steven Rumbalski Aug 28 '13 at 21:29
  • @abarnert I have updated my question, but in short `key` is specified externally and when two dictionaries' `key` have the same value, one should be deleted from the list. – Jacob Bridges Aug 28 '13 at 21:30
  • @StevenRumbalski I am not sure how to use classes. From what I have gleaned they are incredibly useful, but from starting Python two weeks ago I have not made it that far yet. If you could create a template of some sort I would appreciate it. – Jacob Bridges Aug 28 '13 at 21:31
  • @CameronSparr yes, exactly. – Jacob Bridges Aug 28 '13 at 21:32
  • @HenryKeiter I have update my question to show one dictionary. Do you need three? That would be a lot of text. – Jacob Bridges Aug 28 '13 at 21:33
  • 2
    @Jacob-IT: That's exactly why we ask for a [SSCCE](http://sscce.org). Clearly the fact that the dictionaries have 32 keys isn't relevant to your problem; it would be the same problem if they only had 3 keys, right? So, give us a version stripped down to 3 dictionaries with 3 keys, and it's much less text than what you have now, and much clearer. – abarnert Aug 28 '13 at 21:47
  • Oh my gosh people, don't hate. I really am trying to explain this clearly. Now, @abarnert I updated the question to match your SSCCE. – Jacob Bridges Aug 28 '13 at 22:01
  • @Jacob-IT: What you've given us now doesn't even compile, because of at least one typo. You've wasted an hour of your time with dozens of comments and edits, and are apparently getting annoyed with the process, but that's not anyone else's fault. If you give us code and data that demonstrates your problem and can be debugged, we can solve it quickly; if you don't, it will be a frustrating experience for you, because you're effectively putting yourself into the middle of everyone else's edit-debug cycle. – abarnert Aug 28 '13 at 22:11
  • @abarnert: Thank you for your answer but your comments are a mixture of questions and nonconstructive blabber. Please keep your comments a simple request so that others (and myself) can clearly see the edit trail. – Jacob Bridges Aug 28 '13 at 22:16
  • @Jacob-IT: I'm assuming you didn't already know that we want an SSCCE, and why, and what it means (because the only other possibility is that you were being deliberately obstructive, which hardly seems likely). So explaining that to you is not "nonconstructive blabber". Especially after people had asked you half a dozen times for enough information without getting it before the explanation. Again, I understand that you're frustrated by the process, but you're the one making it hard, not everyone who's trying to help you. – abarnert Aug 28 '13 at 22:20

8 Answers8

11

Just as you can use a tuple to get a hashable equivalent to a list, you can use a frozenset to get a hashable equivalent to a dict. The only trick is that you need to pass d.items() rather than d to the constructor.

>>> d = {'a': 1, 'b': 2}
>>> s = frozenset(d.items())
>>> hash(s)
-7588994739874264648
>>> dict(s) == d
True

And then you can use your favorite of the solutions you've already seen. Dump them into a set, or use an OrderedSet or the unique_everseen recipe if you need to preserve order, etc. For example:

>>> unique_sets = set(frozenset(d.items()) for d in list_of_dicts)
>>> unique_dicts = [dict(s) for s in unique_sets]

Or, preserving order and using a key value:

>>> sets = (frozenset(d.items()) for d in list_of_dicts)
>>> unique_sets = unique_everseen(sets, key=operator.itemgetter(key))
>>> unique_dicts = [dict(s) for s in unique_sets]

Of course if you have lists or dicts nested within, you will have to convert recursively, just as you would for a list of lists.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • unique_everseen expects key to be a function. >>> unique_sets = unique_everseen(sets, key=operator.itemgetter) – Ullullu Jun 06 '17 at 19:15
6

Use itertools.groupby() to group your dictionaries by key value, then take the first item from each group.

import itertools

data =[ {
    "ID" : "0001",
    "Organization" : "SolarUSA",
    "Matchcode" : "SolarUSA, Something Street, Somewhere State, Whatev Zip",
    "Owner" : "Timothy Black",
   }, {
    "ID" : "0002",
    "Organization" : "SolarUSA",
    "Matchcode" : "SolarUSA, Something Street, Somewhere State, Whatev Zip",
    "Owner" : "Johen Wilheim",
   }, {
    "ID" : "0003",
    "Organization" : "Zapotec",
    "Matchcode" : "Zapotec, Something Street, Somewhere State, Whatev Zip",
    "Owner" : "Simeon Yurrigan",
   } ]


print [g.next() for k,g in itertools.groupby(data, lambda x: x['Matchcode'])]

Gives the result

[{'Owner': 'Timothy Black',  
  'Organization': 'SolarUSA', 
  'ID': '0001',  
  'Matchcode': 'SolarUSA, Something Street, Somewhere State, Whatev Zip'},

 {'Owner': 'Simeon Yurrigan', 
  'Organization': 'Zapotec', 
  'ID': '0003', 
  'Matchcode':'Zapotec, Something Street, Somewhere State, Whatev Zip'}]

I believe this is what you're looking for.

EDIT: I like the unique_justseen solution better. It's shorter and more descriptive.

Marc Poulin
  • 81
  • 2
  • 2
  • If someone getting an Attribute error just change >>> print [g.next() for k,g in itertools.groupby(data, lambda x: x['Matchcode'])] to this >>> print [next(g) for k,g in itertools.groupby(data, lambda x: x['Matchcode'])] – The Mask Oct 31 '20 at 09:08
4

This answer is not correct for the now-disambiguated problem.


Community
  • 1
  • 1
Mike Graham
  • 73,987
  • 14
  • 101
  • 130
  • But doesn't this code return a list of keys from within the dicts? I need a script to return a list of dicts, only with the duplicates missing. – Jacob Bridges Aug 28 '13 at 21:06
  • @Jacob-IT, unique_values is an iterable of dicts that have been uniquified (where "duplicates" were taken to be dicts that were equal). – Mike Graham Aug 28 '13 at 21:08
  • @StevenRumbalski, Can you explain what the heck you're talking about? – Mike Graham Aug 28 '13 at 21:15
  • @StevenRumbalski, I don't think you've understood what the posted code does at all. `unique_everseen` returns an iterable of dicts given an iterable of dicts (`list_of_dicts` being OP's list of dicts). I don't know why you think it will return "just the keys" or "a tuple of dictionary values" or anything else. – Mike Graham Aug 28 '13 at 21:30
  • To get it as a list in particular, one would just do `list(unique_everseen(list_of_dicts, key=whatever))`. – Mike Graham Aug 28 '13 at 21:31
  • 2
    Your `as_values` function is just `operator.itemgetter(*the_keys)`. – abarnert Aug 28 '13 at 21:32
  • Are you sure this is no longer correct? I think `list_of_dicts = list(unique_everseen(list_of_dicts, key=itemgetter(key)))` is what "Now I just need to iterate over the list, accessing the list[dictionary][key] and deleting duplicates when two key values match" is asking for. Am I still misreading the OP? – abarnert Aug 28 '13 at 21:44
  • @abarnert, That's not the same answer, that's a different answer. – Mike Graham Aug 28 '13 at 22:07
  • @MikeGraham: But the only difference is that the `key` function uses a _single_ key, rather than a _pair_ of keys. I've written a separate answer for that, but really, I think yours is already correct and should be accepted. – abarnert Aug 28 '13 at 22:12
1

So I have a neat list of dictionaries sorted by Matchcode. Now I just need to iterate over the list, accessing the list[dictionary][key] and deleting duplicates when two key values match.

I'm still not entirely sure what this means. It sounds like you're saying they will always be sorted by the same key you want to use to uniquify. If so, you can just use unique_justseen from the itertools recipes, using the same key function you used in the sort, such as itemgetter(key).

Using the sample list_of_dicts from your edited question:

>>> list(unique_justseen(list_of_dicts, key=itemgetter('Matchcode')))
[{'ID': '0001',
  'Matchcode': 'SolarUSA, Something Street, Somewhere State, Whatev Zip',
  'Organization': 'SolarUSA',
  'Owner': 'Timothy Black'},
 {'ID': '0003',
  'Matchcode': 'Zapotec, Something Street, Somewhere State, Whatev Zip',
  'Organization': 'Zapotec',
  'Owner': 'Simeon Yurrigan'}]

If they're sorted by a different key to the one we're uniquifying on, then the fact that they're sorted isn't relevant at all, and unique_justseen won't work:

>>> list_of_dicts.sort(key=itemgetter('Owner'))
>>> list(unique_justseen(list_of_dicts, key=itemgetter('Matchcode')))
[{'ID': '0002',
  'Matchcode': 'SolarUSA, Something Street, Somewhere State, Whatev Zip',
  'Organization': 'SolarUSA',
  'Owner': 'Johen Wilheim'},
 {'ID': '0003',
  'Matchcode': 'Zapotec, Something Street, Somewhere State, Whatev Zip',
  'Organization': 'Zapotec',
  'Owner': 'Simeon Yurrigan'},
 {'ID': '0001',
  'Matchcode': 'SolarUSA, Something Street, Somewhere State, Whatev Zip',
  'Organization': 'SolarUSA',
  'Owner': 'Timothy Black'}]

But then you just have to use the unique_everseen recipe instead:

>>> list_of_dicts.sort(key=itemgetter('Owner'))
>>> list(unique_everseen(list_of_dicts, key=itemgetter('Matchcode')))
[{'ID': '0002',
  'Matchcode': 'SolarUSA, Something Street, Somewhere State, Whatev Zip',
  'Organization': 'SolarUSA',
  'Owner': 'Johen Wilheim'},
 {'ID': '0003',
  'Matchcode': 'Zapotec, Something Street, Somewhere State, Whatev Zip',
  'Organization': 'Zapotec',
  'Owner': 'Simeon Yurrigan'}]

(Of course this time we've got 0002 instead of 0001, because after sorting on Owner it's now the first value for its Matchcode instead of the second.)


The fact that dictionaries aren't hashable isn't relevant here, because the recipes are just storing the result of the key function in their sets, so as long as the values stored at key key are hashable, everything is fine.

abarnert
  • 354,177
  • 51
  • 601
  • 671
1

Now that we can see that two dictionaries are duplicates if a particular key matches, the problem is very simple. Just iterate over the dictionaries; keep track of the keys you've seen, and finally make a new list out of the unique ones.

import collections
def get_unique_items(list_of_dicts, key="Matchcode"):
    # Count how many times each key occurs.
    key_count = collections.defaultdict(lambda: 0)
    for d in list_of_dicts:
        key_count[d[key]] += 1

    # Now return a list of only those dicts with a unique key.
    return [d for d in list_of_dicts if key_count[d[key]] == 1]

Note that I use a defaultdict here to count occurrences of each key (there are other ways to do this but I think this is the cleanest, personally). The reason I didn't use a set to track "visited" keys is that you'll get one copy of every key in the list, including the duplicated ones. That means you would have to keep a second set, to track the keys that are truly duplicated (those that are already in the "visited" keys when you encounter them), so you don't include them.

On the other hand, if what you want is just to take the first dictionary you see with each given key, whether there's a duplicate or not later on, a set approach would work fine, like in Mike Graham's second answer.

Community
  • 1
  • 1
Henry Keiter
  • 16,863
  • 7
  • 51
  • 80
0
seen_values = set()
without_duplicates = []
for d in list_of_dicts:
    value = d[key]
    if value not in seen_values:
        without_duplicates.append(d)
        seen_values.add(value)
Mike Graham
  • 73,987
  • 14
  • 101
  • 130
0

Basically you need something like no_dup(checked_val_extrator, list) where no_dup may look like:

def no_dup(extractor, lst):
   "keeps only first elements encountered for any particular extracted value using =="
   known = set()
   res = []
   for item in lst:
     if extractor(item) in known: continue
     known.add(extractor(item))
     res.append(item)
   return res
-1

I'm not 100% clear on exactly what you're trying to achieve, but:

Remove all duplicate dictionary entries

So long as you don't mind combining all the dictionaries,

import itertools
dict(itertools.chain(*map(lambda x: x.items(), list_of_dictionaries)))
jdotjdot
  • 16,134
  • 13
  • 66
  • 118
  • 1
    (In general, you can count on the fact that if your code uses `eval`, you're doin' it wrong. ;) ) – Mike Graham Aug 28 '13 at 21:11
  • Thanks for your example Mike--I stand corrected. As far as `eval`, I don't fully agree with you. `eval` is terrible for production and not generally good practice, but for isolated internal one-off data processing hacks, it often is fine. – jdotjdot Aug 28 '13 at 21:21