Determine uniqueness of rows

Question

I have a Pandas dataframe (truncated) that looks like the following:

                           Name                              Identifiers
419                         Foo  {'VendorName': 'A', 'ProductName': 'Z'}
420                         Foo  {'VendorName': 'A', 'ProductName': 'Z'}
421                         Foo  {'VendorName': 'A', 'ProductName': 'Y'}
422                         Foo  {'VendorName': 'B', 'ProductName': 'X'}
423                         Bar  {'VendorName': 'A', 'ProductName': 'Z'}

Using Pandas, I want to count the number of unique Name and Identifiers combinations. I'm having issues since Identifiers is a dictionary. All dictionaries will have the same keys, but the values will obviously differ. Given the above example, there are 4 unique combinations.

Try using frozenset method: `df["Identifiers"] = df["Identifiers"].apply(lambda x:frozenset(x.items())); g = df.groupby(["Name","Identifiers"]); len(g)` then `df.Identifiers.apply(dict)` to get back dictionaries. — hellpanderr, Jul 26 '19 at 21:03
@Prune I don't think this is a duplicate, in fact it is quite the opposite to the linked question: there they want to get a *list of duplicate items*, here the OP wants to get the *number of unique items*. — Stef, Jul 26 '19 at 22:15
@Prune in principle yes, but the accepted answer is very old dating back to a time when the keep=False keyword argument wasn't available and hence overly complicated now; besides non of the answers will work for dataframes with dictionary series as in this question. — Stef, Jul 27 '19 at 07:21
Good points. I'll reopen, and we'll get better, updated responses. — Prune, Jul 29 '19 at 15:48

Stef · Answer 1 · 2019-07-26T22:03:57.533

0

len(df.groupby(df.Name + df.Identifiers.astype(str)))

It doesn't play any role whether the Identifiers are dictionaries, strings or whatever - groupby just groups identical elements, no matter what type of elements they are.

Update: ... provided that the objects are hashable. If they are not, as e.g. dictionaries, lists etc., you'll have to cast them to any immutable type first. This can be for instance str (as shown above, no 100% guarantee as insertion order is maintained for Python 3.6+) or, as hellpanderr proposed in his comment above, frozenset like len(df1.groupby(lambda x: df1.Name[x].__hash__() ^ frozenset(df1.Identifiers[x].items()).__hash__()))

edited Jul 26 '19 at 22:03

answered Jul 26 '19 at 20:50

Stef

28,728
2
24
52

pandas 0.23.4 throws an error `TypeError: unhashable type: 'dict'` – hellpanderr Jul 26 '19 at 20:55
@hellpanderr please see my updated answer, surely there is a more elegant way, but this quick and dirty solution works for me – Stef Jul 26 '19 at 21:07
@roundtheworld: Did this answer solve your issue? If so, please consider [accepting the answer](https://stackoverflow.com/help/someone-answers), if not please comment. This keeps this site useful and helps others quickly find and answer to similar questions. Thanks. – Stef Aug 02 '19 at 07:19

Determine uniqueness of rows

1 Answers1