1

How can I check and remove duplicate values from items in a dictionary? I have a large data set so I'm looking for an efficient method. The following is an example of values in a dictionary that contains a duplicate:

'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]

needs to become

'word': [('769817', [6]), ('769819', [4, 10])]
Brutalized
  • 85
  • 2
  • 9

7 Answers7

1

This problem essentially boils down to removing duplicates from a list of unhashable types, for which converting to a set does not possible.

One possible method is to check for membership in the current value while building up a new list value.

d = {'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]}
for k, v in d.items():
    new_list = []
    for item in v:
        if item not in new_list:
            new_list.append(item)
    d[k] = new_list

Alternatively, use groupby() for a more concise answer, although potentially slower (the list must be sorted first, if it is, then it is faster than doing a membership check).

import itertools

d = {'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]}
for k, v in d.items():
    v.sort()
    d[k] = [item for item, _ in itertools.groupby(v)]

Output -> {'word': [('769817', [6]), ('769819', [4, 10])]}

ospahiu
  • 3,465
  • 2
  • 13
  • 24
  • Added another approach, I would explore bench-marking the two approaches to see what works better with your dataset. – ospahiu Sep 14 '16 at 17:50
  • @mrdomoboto: Why `list()` wrapping genexpr? `[item for item, _ in itertools.groupby(v)]` is going to be faster/more succinct than a genexpr that just gets passed to `list` constructor anyway. Also, `groupby` is only slower if `list` is unsorted; if it's already sorted, as in the example, no sorting needed, so it's actually faster than any other solution in theory (`O(n)` with no meaningful constant factors). It could be optimized further if needed to push all work to C layer by changing comprehension to `list(map(operator.itemgetter(0), itertools.groupby(v)))` (`list()` wrapping is Py3 only). – ShadowRanger Sep 14 '16 at 18:00
  • You are correct, the `list()` wrapper does add another order of magnitude cost to that line. I've updated my answer. Yes, `groupby()` is far faster if the list is already sorted, which is why I added the note. – ospahiu Sep 14 '16 at 18:15
0
your_list = [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
new = []
for x in your_list:
    if x not in new: new.append(x)

print(new)    
>>>[('769817', [6]), ('769819', [4, 10])]
levi
  • 22,001
  • 7
  • 73
  • 74
  • 1
    That is a terrible misuse of list comprehensions. Don't use a `list` comprehension for side-effects. They're a functional construct, and functional code is side-effect free; using them for side-effects is wasteful (creates a garbage `list` of `None`s that just gets thrown away after) and confusing. Just use a plain `for` loop for those cases. Also, never name a variable `list`; shadowing built-ins is terrible practice. – ShadowRanger Sep 14 '16 at 17:43
  • @ShadowRanger I fixed it. – levi Sep 14 '16 at 17:47
  • You did, but the new code has its own issues in the general case (no issues for the two-`tuple` case admittedly), since falsy values will be mistaken for duplicates by that test (a `list` like `[1, None, 0, 2]`, which has no duplicates, would still get reduced to `[1, 2]` because `None` and `0` look like duplicates to this logic). You could just do `if x not in new: new.append(x)` and avoid the `to_append` temporary and the risk of confusing falsy values for duplicate values. – ShadowRanger Sep 14 '16 at 17:58
0

You have a list, not a dictionary. Python dictionaries may have only one value for each key. Try

my_dict = dict([('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])])

result:

{'769817': [6], '769819': [4, 10]}

a Python dictionary. For more information https://docs.python.org/3/tutorial/datastructures.html#dictionaries

Patrick Haugh
  • 59,226
  • 13
  • 88
  • 96
0

Strikethrough applied to original question before edits, left for posterity: You're not using a dict at all, just a list of two-tuples, where the second element in each tuple is itself a list. If you actually want a dict,

dict([('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])])

will convert it, and uniquify by key (so you'd end up with {'769817': [6], '769819': [4, 10]}, though it loses order, and doesn't pay attention to whether the values (the sub-lists) are unique or not (it just keeps the last pairing for a given key).

If you need to uniquify adjacent duplicates (where the values are important to uniqueness) while preserving order, and don't want/need a real dict, use itertools.groupby:

import itertools
nonuniq = [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
uniq = [k for k, g in itertools.groupby(nonuniq)]
# uniq is [('769817', [6]), ('769819', [4, 10])]
# but it wouldn't work if the input was
# [('769819', [4, 10]), ('769817', [6]), ('769819', [4, 10])]
# because the duplicates aren't adjacent

If you need to collapse non-adjacent duplicates, and don't need to preserve order (or sorted order is fine), you can use groupby to get a O(n log n) solution (as opposed to naive solutions that create a new list and avoid duplicates by checking for presence in the new list at O(n^2) complexity, or set based solutions that would be O(n) but require you to convert sub-lists in your data to tuples to make them hashable):

# Only difference is sorting nonuniq before grouping
uniq = [k for k, g in itertools.groupby(sorted(nonuniq))]
# uniq is [('769817', [6]), ('769819', [4, 10])]
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • i am using a dictionary. i have multiple tuples for 1 key – Brutalized Sep 14 '16 at 17:41
  • @Brutalized: Your original question didn't make that clear. In any event, the last two options are more optimized versions of the current accepted answer (if the data is already ordered, the non-sorting `groupby` is `O(n)`, if it needs sorting, it's `O(n log n)`, both of which are a major improvement on `O(n^2)` of the loop with `list` contain checking on every iteration). – ShadowRanger Sep 14 '16 at 17:50
0

How about this: I am just focusing on the list part:

>>> s = [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
>>> [(x,y) for x,y in {key: value for (key, value) in s}.items()]
[('769817', [6]), ('769819', [4, 10])]
>>>
James Sapam
  • 16,036
  • 12
  • 50
  • 73
  • That list comprehension could simplify to just: `dict(s).items()` (`list(dict(s).items())` on Py3). No need to turn everything into a comprehension when the data is already what the constructors expect. It also only uniqifies by the first element in the `tuple`, not the second, and does not preserve order, which may not match requirements. It's efficient if that's acceptable though. – ShadowRanger Sep 14 '16 at 17:55
  • thats very thoughtful and much more efficient. – James Sapam Sep 14 '16 at 18:53
0

You can uniqify the items based on the hash they generate. Hash could be anything, a sorted json.dumps, or cPickle.dumps. This one liner can uniqify your dict as required.

>>> d =  {'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]}
>>> import json
>>> { k: { json.dumps(x,sort_keys = True):x for x in v}.values() for k,v in d.iteritems()}
{'word': [('769817', [6]), ('769819', [4, 10])]}
DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
0

How about that?

    def remove_duplicates(d: dict):
        unique_values = set(d.values())
        o = {}
        for k, v in d.items():
           if v in unique_values:
                o[k] = v
                unique_values.remove(v)
        return o
Philippe Remy
  • 2,967
  • 4
  • 25
  • 39