7

I have a list of dictionaries in Python, which looks like following:

d = [{feature_a:1, feature_b:'Jul', feature_c:100}, {feature_a:2, feature_b:'Jul', feature_c:150}, {feature_a:1, feature_b:'Mar', feature_c:110}, ...]

What I want to achieve is that to keep the feature_a, _b and _c unique.

For example, if we have 3 entries which have the same feature_a and _b, but have 3 different values of feature_c 100, 100, 150, then after the operation, it should be 100 and 150.

How can I achieve this?

================================================================ UPDATE:

OK, Thanks for Anand's excellent answer, it works perfectly. However, I have a further question.

Suppose we have a new feature_d and the dictionary looks like:

d = [{feature_a:1, feature_b:'Jul', feature_c:100, feature_d:'A'}, {feature_a:2, feature_b:'Jul', feature_c:150, feature_d: 'B'}, {feature_a:1, feature_b:'Mar', feature_c:110, feature_d:'F'}, ...]

and I only want to deduplicate feature_a, _b and _c, but leave feature_d out. How can I achieve this?

Many thanks.

ChangeMyName
  • 7,018
  • 14
  • 56
  • 93
  • 5
    It sounds like you are using the wrong layout. Why not have a dictionary where the keys are the features and the values are `set`s? – rlbond Aug 03 '15 at 16:50

1 Answers1

10

If the order of the initial d list is not important , you can take the .items() of each dictionary and convert it into a frozenset() , which is hashable, and then you can convert the whole thing to a set() or frozenset() , and then convert each frozenset() back to dictionary. Example -

uniq_d = list(map(dict, frozenset(frozenset(i.items()) for i in d)))

sets() do not allow duplicate elements. Though you would end up losing the order of the list. For Python 2.x , the list(...) is not needed, as map() returns a list.


Example/Demo -

>>> import pprint
>>> pprint.pprint(d)
[{'feature_a': 1, 'feature_b': 'Jul', 'feature_c': 100},
 {'feature_a': 2, 'feature_b': 'Jul', 'feature_c': 150},
 {'feature_a': 1, 'feature_b': 'Mar', 'feature_c': 110},
 {'feature_a': 1, 'feature_b': 'Jul', 'feature_c': 100},
 {'feature_a': 1, 'feature_b': 'Jul', 'feature_c': 150}]
>>> uniq_d = list(map(dict, frozenset(frozenset(i.items()) for i in d)))
>>> pprint.pprint(uniq_d)
[{'feature_a': 1, 'feature_b': 'Jul', 'feature_c': 100},
 {'feature_a': 1, 'feature_b': 'Jul', 'feature_c': 150},
 {'feature_a': 1, 'feature_b': 'Mar', 'feature_c': 110},
 {'feature_a': 2, 'feature_b': 'Jul', 'feature_c': 150}]

For the new requirement -

However, what if that I have another feature_d but I only want to dedup feature_a, _b and _c

If two entries which have same feature_a, _b and _c, they are considered the same and duplicated, no matter what is in feature_d

A simple way to do this is to use a set and a new list, add only the features you need to the set, and check using only the features you need. Example -

seen_set = set()
new_d = []
for i in d:
    if tuple([i['feature_a'],i['feature_b'],i['feature_c']]) not in seen_set:
        new_d.append(i)
        seen_set.add(tuple([i['feature_a'],i['feature_b'],i['feature_c']]))

Example/Demo -

>>> d = [{'feature_a':1, 'feature_b':'Jul', 'feature_c':100, 'feature_d':'A'},
...  {'feature_a':2, 'feature_b':'Jul', 'feature_c':150, 'feature_d': 'B'},
...  {'feature_a':1, 'feature_b':'Mar', 'feature_c':110, 'feature_d':'F'},
...  {'feature_a':1, 'feature_b':'Mar', 'feature_c':110, 'feature_d':'G'}]
>>> seen_set = set()
>>> new_d = []
>>> for i in d:
...     if tuple([i['feature_a'],i['feature_b'],i['feature_c']]) not in seen_set:
...         new_d.append(i)
...         seen_set.add(tuple([i['feature_a'],i['feature_b'],i['feature_c']]))
...
>>> pprint.pprint(new_d)
[{'feature_a': 1, 'feature_b': 'Jul', 'feature_c': 100, 'feature_d': 'A'},
 {'feature_a': 2, 'feature_b': 'Jul', 'feature_c': 150, 'feature_d': 'B'},
 {'feature_a': 1, 'feature_b': 'Mar', 'feature_c': 110, 'feature_d': 'F'}]
Community
  • 1
  • 1
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • Hi, Anand, Thanks for your answer, it works perfectly. However, what if that I have another `feature_d` but I only want to dedup `feature_a`, `_b` and `_c`? Many thanks. – ChangeMyName Aug 04 '15 at 09:24
  • Can you update the question with a sample dictionary for that? – Anand S Kumar Aug 04 '15 at 09:27
  • 1
    Asssume two elements which have `feature_a` and `_b` and `_c` same but different `_d` what happens then? Are they considered duplcate of each other? – Anand S Kumar Aug 04 '15 at 09:38
  • Yes, if two entries which have same `feature_a`, `_b` and `_c`, they are considered the same and duplicated, no matter what is in `feature_d`. – ChangeMyName Aug 04 '15 at 09:45
  • Thanks a lot, Anand. Excellent answer! – ChangeMyName Aug 04 '15 at 11:20
  • @AnandSKumar wouldn't a `not in` statement turn a list basically into a set? so wouldn't saying `not in seen_set` be a bit kinda redundant? because `seen_set` is already a `set` and can't have duplicates? why not use a `list` instead of a `set`? – Halcyon Abraham Ramirez Aug 05 '15 at 16:07
  • `not in` will not do any kind of conversion , where did you read that? Also, containment check in `set` is O(1) (Constant time) , whereas containtment check in `list` is O(n) , so using set would be faster. And sets cannot hold duplicates , though that property is not exactly used in this context. – Anand S Kumar Aug 05 '15 at 16:09
  • what I meant by `turn a list basically into a set` is if you use a `not in` on a list then the list can't contain duplicates and is basically like a `set` not actually convert into one. so sets are faster than lists? – Halcyon Abraham Ramirez Aug 05 '15 at 16:29
  • 1
    Yes , sets are very much faster than list , try it out yourself . you will see. Create a list of 100000 items and create a set of 100000 items and try to do `elem in list` and `elem in set` . – Anand S Kumar Aug 05 '15 at 16:30
  • @AnandSKumar well ok that would make using sets better then. thanks Anand :D – Halcyon Abraham Ramirez Aug 05 '15 at 16:58