7

I have data that looks like:

Identifier  Category1 Category2 Category3 Category4 Category5
1000           foo      bat       678         a.x       ld
1000           foo      bat       78          l.o       op
1000           coo      cat       678         p.o       kt
1001           coo      sat       89          a.x       hd
1001           foo      bat       78          l.o       op
1002           foo      bat       678         a.x       ld
1002           foo      bat       78          l.o       op
1002           coo      cat       678         p.o       kt

What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:

  1. First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.
  2. Compare each of the groups/sub-data frames.

One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).

Any help is appreciated, I am very new to Python. Thanks in advance!

S.k.S
  • 73
  • 1
  • 5

2 Answers2

5

You could do something like the following:

import pandas as pd

input_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']

duplicate_entries = {}

for group in input_file.groupby('Identifier'):
    # transforming to tuples so that it can be used as keys on a dict
    lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]    
    key = tuple(lines) 

    if key not in duplicate_entries:
        duplicate_entries[key] = []

    duplicate_entries[key].append(group[0])

Then the duplicate_entries values will have the list of duplicate Identifiers

duplicate_entries.values()
> [[1000, 1002], [1001]]

EDIT:

To get only the entries that have duplicates, you could have something like:

all_dup = [dup for dup in duplicate_entries if len(dup) > 1]

Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1] and the 'Identifier' for that group is found at [0]. Because on the duplicate_entries array we'd like the identifier of that entry, using group[0] would get us that.

Raquel Guimarães
  • 957
  • 1
  • 12
  • 18
  • Thank you so much for answering! This seems to be working. However, given that my actual data set is pretty huge, is there a way to see only those sets/tuples that have a collection of identifiers that are identical and not all sets? For example, what I want to see as output is: [[1000,1002]] instead of [[1000, 1002], [1001]]. Also, why do we say 0 in append(group[0]) and not 1? Thanks! – S.k.S Jun 11 '17 at 03:44
  • I'm glad it worked! I've edited the answer to add the 'only duplicates' part and an explanation on the indices. Hope it's clearer now – Raquel Guimarães Jun 11 '17 at 13:36
  • Thank you soo much! The all_dup part is throwing an error sadly. The error message is: IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable `--NotebookApp.iopub_data_rate_limit`. Any idea what this is about? – S.k.S Jun 11 '17 at 18:23
  • I don't know.. Maybe the answer to this question can help you: https://stackoverflow.com/questions/43288550/iopub-data-rate-exceeded-when-viewing-image-in-jupyter-notebook – Raquel Guimarães Jun 11 '17 at 19:29
2

We could separate in groups with groupby, then sort all groups (so we can detect equals even when rows are in different order) by all columns except for "Identifier" and compare the groups:

Suppose that columns = ["Identifier", "Category1", "Category2", "Category3", "Category4", "Category5"]

We can do:

groups = []
pure_groups = []
for name, group in df.groupby("Identifier"):
    pure_groups += [group]
    g_idfless = group[group.columns.difference(["Identifier"])]
    groups += [g_idfless.sort_values(columns[1:]).reset_index().drop("index", axis=1)]

And to compare them:

for i in range(len(groups)):
    for j in range(i + 1, len(groups)):
        id1 = str(pure_groups[i]["Identifier"].iloc[0])
        id2 = str(pure_groups[j]["Identifier"].iloc[0])
        print(id1 + " and " + id2 + " equal?: " + str(groups[i].equals(groups[j])))

#-->1000 and 1001 equal?: False
#-->1000 and 1002 equal?: True
#-->1001 and 1002 equal?: False

EDIT: Added code to print the identifiers of the groups that match

lukess
  • 964
  • 1
  • 14
  • 19
  • Thank you so much for answering! So this works, but owing to my huge data set the kernel seems to crash after the code iterates for a while. Also, when there is a true match, I would love to see which sets those are. – S.k.S Jun 11 '17 at 03:47