I have a dataset of companies which looks like this:
tax_id phone address
0 A 0 x
1 B 1 y
2 C 2 z
3 D 3 x
4 E 4 y
5 A 5 x
6 B 0 t
7 C 0 z
8 F 6 u
9 E 3 v
I need to deduplicate it using the following rules:
- 2 companies are considered to be the same if ANY of 3 attributes match (see row 0 and row 5)
- 2 companies are considered to be the same if ALL their attributes differ, but they have matches via a chain of other companies which are considered to be the same by condition 1 (the simpliest example: rows 0 and 1 are "connected" via row 6)
There is one additional task: I want to mark each duplicate row with id of its unique "representative".
The resulting collection is here:
tax_id phone address id representative_index
0 A 0 x 0 0
1 B 1 y 1 1
2 C 2 z 2 2
3 D 3 x 0 0
4 E 4 y 1 1
5 A 5 x 0 0
6 B 0 t 0 0
7 C 0 z 0 0
8 F 6 u 8 8
9 E 3 v 0 0
Is there any way to deduplicate this collection?
I would prefer to use SQL, but if it is not possible - Pandas/Python would work. I have created a similair question for Pandas (Pandas and python: deduplication of dataset by several fields), but I want to expand the tech stack to SQL.