I have a dataset of companies. Each company has tax payer number, address, phone and some other fields. Here is a Pandas code I take from Roméo Després:
import pandas as pd
df = pd.DataFrame({
"tax_id": ["A", "B", "C", "D", "E", "A", "B", "C", "F", "E"],
"phone": [0, 1, 2, 3, 4, 5, 0, 0, 6, 3],
"address": ["x", "y", "z", "x", "y", "x", "t", "z", "u", "v"],
})
print(df)
tax_id phone address
0 A 0 x
1 B 1 y
2 C 2 z
3 D 3 x
4 E 4 y
5 A 5 x
6 B 0 t
7 C 0 z
8 F 6 u
9 E 3 v
I need to deduplicate the dataset by these fields, meaning that not-unique companies can be linked by only one of these fields. I.e. some company is definitely unique in my list only if it doesn't have ANY matches by ANY of the key fields. If company shares tax payer num with some other entity, and that entity shares address with 3rd one, then all three companies are the same one. Expected output in terms of unique companies should be:
tax_id phone address
0 A 0 x
1 B 1 y
2 C 2 z
8 F 6 u
Expected output along with unique company index for each duplicate should look like:
tax_id phone address representative_index
0 A 0 x 0
1 B 1 y 1
2 C 2 z 2
3 D 3 x 0
4 E 4 y 1
5 A 5 x 0
6 B 0 t 0
7 C 0 z 0
8 F 6 u 8
9 E 3 v 3
How can I filter out duplicates in this case using python/pandas?
The only algo which comes to my head is the following direct approach:
- I group dataset by first key, collect other keys as sets in resulting dataset
- Then iteratively I walk over set with 2nd key and add to my grouped dataset for some value of 1st key new 2nd key values, iterating over them over and over.
- Finally there is nothing more to add and I repeat this for 3rd key.
This doesn't look very promising in terms of performance and simplicity of coding.
Any other ways for removing duplicates by one of several keys?