I'm trying to clean a Python Pandas dataframe
that contains dirty data with "repeated" (but not exactly duplicated) people information.
id name name2 name3 email
1 A A A email@gmail.com
1 A NaN NaN NaN
NaN A A B email@gmail.com
NaN A A B email@gmail.com
1 A A B NaN
NaN A A A email@gmail.com
Unfortunately I don't have a clear "primary key" since the column id
is not always set and I have a list of different names (name
,name2
,name3
) that don't match always (sometime I have the same name
but different name2
).
I'd like to keep both these information, but removing duplicate rows and "merging" rows in order to remove the maximum number of NaN values, without loosing any king of information.
The output should be that:
id name name2 name3 email
1 A A A email@gmail.com
1 A A B email@gmail.com
The second row is given by the merge between
NaN A A B email@gmail.com
1 A A B NaN
in the original dataframe.
(I already tried the solution here: How can I merge duplicate rows and fill the NaN cells with the values from the other row? but without success)
Thanks.