Pandas and python: deduplication of dataset by several fields

Question

I have a dataset of companies. Each company has tax payer number, address, phone and some other fields. Here is a Pandas code I take from Roméo Després:

import pandas as pd

df = pd.DataFrame({
    "tax_id": ["A", "B", "C", "D", "E", "A", "B", "C", "F", "E"],
    "phone": [0, 1, 2, 3, 4, 5, 0, 0, 6, 3],
    "address": ["x", "y", "z", "x", "y", "x", "t", "z", "u", "v"],
})
print(df)

  tax_id  phone address
0      A      0       x
1      B      1       y
2      C      2       z
3      D      3       x
4      E      4       y
5      A      5       x
6      B      0       t
7      C      0       z
8      F      6       u
9      E      3       v

I need to deduplicate the dataset by these fields, meaning that not-unique companies can be linked by only one of these fields. I.e. some company is definitely unique in my list only if it doesn't have ANY matches by ANY of the key fields. If company shares tax payer num with some other entity, and that entity shares address with 3rd one, then all three companies are the same one. Expected output in terms of unique companies should be:

  tax_id  phone address
0      A      0       x
1      B      1       y
2      C      2       z
8      F      6       u

Expected output along with unique company index for each duplicate should look like:

  tax_id  phone address  representative_index
0      A      0       x                     0
1      B      1       y                     1
2      C      2       z                     2
3      D      3       x                     0
4      E      4       y                     1
5      A      5       x                     0
6      B      0       t                     0
7      C      0       z                     0
8      F      6       u                     8
9      E      3       v                     3

How can I filter out duplicates in this case using python/pandas?

The only algo which comes to my head is the following direct approach:

I group dataset by first key, collect other keys as sets in resulting dataset
Then iteratively I walk over set with 2nd key and add to my grouped dataset for some value of 1st key new 2nd key values, iterating over them over and over.
Finally there is nothing more to add and I repeat this for 3rd key.

This doesn't look very promising in terms of performance and simplicity of coding.

Any other ways for removing duplicates by one of several keys?

Please include a _small_ subset of your data as a __copyable__ piece of code that can be used for testing as well as your expected output for the __provided__ data. See [MRE - Minimal, Reproducible, Example](https://stackoverflow.com/help/minimal-reproducible-example), and [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888). — Henry Ecker, May 17 '21 at 21:11
Just use [drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) method on dataframe — Alexander Volkovsky, May 18 '21 at 16:51

Roméo Després · Accepted Answer · 2021-05-18T17:19:35.007

1

You could solve this using the graph analysis library networkx.

import itertools

import networkx as nx
import pandas as pd


df = pd.DataFrame({
    "tax_id": ["A", "B", "C", "D", "E", "A", "B", "C", "F", "E"],
    "phone": [0, 1, 2, 3, 4, 5, 0, 0, 6, 3],
    "address": ["x", "y", "z", "x", "y", "x", "t", "z", "u", "v"],
})

def iter_edges(df):
    """Yield all relationships between rows."""
    for name, series in df.iteritems():
        for nodes in df.groupby(name).indices.values():
            yield from itertools.combinations(nodes, 2)

def iter_representatives(graph):
    """Yield all elements and their representative."""
    for component in nx.connected_components(graph):
        representative = min(component)
        for element in component:
            yield element, representative


graph = nx.Graph()
graph.add_nodes_from(df.index)
graph.add_edges_from(iter_edges(df))

df["representative_index"] = pd.Series(dict(iter_representatives(graph)))

In the end df looks like:

  tax_id  phone address  representative_index
0      A      0       x                     0
1      B      1       y                     0
2      C      2       z                     0
3      D      3       x                     0
4      E      4       y                     0
5      A      5       x                     0
6      B      0       t                     0
7      C      0       z                     0
8      F      6       u                     8
9      E      3       v                     0

Note you can go df.drop_duplicates("representative_index") to obtain unique rows:

  tax_id  phone address  representative_index
0      A      0       x                     0
8      F      6       u                     8

edited May 18 '21 at 17:19

answered May 17 '21 at 22:56

Roméo Després

1,777
2
15
30

1

Romeo, great solution! – MiamiBeach May 18 '21 at 05:47
Romeo, do you think there is a way to set unique "representative" of each duplicate , meaning : for row 5 it is row 0 (by tax id), for row 3 it is again row 0 (by address) ? – MiamiBeach May 18 '21 at 10:00
Sure, just edited the answer with this. You should probably edit your question and add the two expected outputs (`unique_companies` and `df["representative_index"]` in my examples) – Roméo Després May 18 '21 at 13:30
1

thanks a lot! I have edited the question, I hope they will reopen it. – MiamiBeach May 18 '21 at 13:57
Romeo, looks like there is a flaw in the last part. See row No 9. It has representative_index=3, but row 3 is not unique itself. Row No9 should have representative_index = 0. Row 9 is a touch case when some row is connected with "the root" via another row. – MiamiBeach May 18 '21 at 14:26
100% true sorry. Let me edit this answer once more! – Roméo Després May 18 '21 at 14:58
Juste edited with a solution. It's not obvious from the data but it also works if a row is connected to the root via any number of row (not just one as in my sample data). – Roméo Després May 18 '21 at 15:18
Romeo, so maybe graphs is the only solution here? Looking at first part: see row 6: it is a sort of connection between row 0 and row 1. Which means unique row set should not contain rows 0 and 1 together. And your initial algo considers them as unique rows. – MiamiBeach May 18 '21 at 15:39
Just created a similair question in order to extend tech stack to SQL: https://stackoverflow.com/questions/67590555/deduplication-of-dataset-which-has-several-uniqueness-attributes-in-sql-or-panda – MiamiBeach May 18 '21 at 16:46
You are right, the graph was only partly defined in the previous version. Actually in my sample data, almost all rows are connected except numer 8! I've edited with (hopefully) correct code. – Roméo Després May 18 '21 at 17:21

Pandas and python: deduplication of dataset by several fields

1 Answers1

Linked