Many thanks for reading.
I have a pandas data frame of roughly 200,000 rows and 46 columns. 23 of these columns end in "_1" and the other 23 end in "_2". For example:
forename_1 surname_1 area_1 forename_2 surname_2 area_2
george neil g jim bob k
jim bob k george neil g
pete keith k dan joe q
dan joe q pete keith k
ben steve w richard ed p
charlie david s graham josh l
I have successfully removed duplicates using drop_duplicates, but now want to remove rows that are duplicates but the group they are in (1 or 2) has been inverted.
That is, for one row, I want to compare the combined values in forename_1, surname_1 and area_1 with the combined values in forename_2, surname_2 and area_2 for all other rows.
The kind of test I am looking to use would be something like:
If "forename_1 + surname_1 + area_1 + forename_2 + surname_2 + area_2" = "forename_2 + surname_2 + area_2 + forename_1 + surname_1 + area_1", then de-duplicate
I would want to only keep the first duplicate row out of the x number of duplicates (e.g. keep='first').
To help explain, there are two cases above where a duplicate would need to removed:
forename_1 surname_1 area_1 forename_2 surname_2 area_2
george neil g jim bob k
jim bob k george neil g
forename_1 surname_1 area_1 forename_2 surname_2 area_2
pete keith k dan joe q
dan joe q pete keith k
george + neil + g + jim + bob + k = george + neil + g + jim + bob + k etc...
In each case, the second row of the two would be removed, meaning my expected output would be:
forename_1 surname_1 area_1 forename_2 surname_2 area_2
george neil g jim bob k
pete keith k dan joe q
ben steve w richard ed p
charlie david s graham josh l
I have seen an answer that deals with this in R, but is there also a way that this can be done in Python?
Compare group of two columns and return index matches R
Many thanks.