1

I have a dataset of companies which looks like this:

  tax_id  phone address
0      A      0       x
1      B      1       y
2      C      2       z
3      D      3       x
4      E      4       y
5      A      5       x
6      B      0       t
7      C      0       z
8      F      6       u
9      E      3       v

I need to deduplicate it using the following rules:

  1. 2 companies are considered to be the same if ANY of 3 attributes match (see row 0 and row 5)
  2. 2 companies are considered to be the same if ALL their attributes differ, but they have matches via a chain of other companies which are considered to be the same by condition 1 (the simpliest example: rows 0 and 1 are "connected" via row 6)

There is one additional task: I want to mark each duplicate row with id of its unique "representative".

The resulting collection is here:

  tax_id  phone address  id  representative_index
0      A      0       x   0                     0
1      B      1       y   1                     1
2      C      2       z   2                     2
3      D      3       x   0                     0
4      E      4       y   1                     1
5      A      5       x   0                     0
6      B      0       t   0                     0
7      C      0       z   0                     0
8      F      6       u   8                     8
9      E      3       v   0                     0

Is there any way to deduplicate this collection?

I would prefer to use SQL, but if it is not possible - Pandas/Python would work. I have created a similair question for Pandas (Pandas and python: deduplication of dataset by several fields), but I want to expand the tech stack to SQL.

MiamiBeach
  • 3,261
  • 6
  • 28
  • 54

0 Answers0