1

I'm trying to clean a Python Pandas dataframe that contains dirty data with "repeated" (but not exactly duplicated) people information.

id  name    name2   name3   email
1   A       A       A       email@gmail.com
1   A       NaN     NaN     NaN
NaN A       A       B       email@gmail.com
NaN A       A       B       email@gmail.com
1   A       A       B       NaN
NaN A       A       A       email@gmail.com

Unfortunately I don't have a clear "primary key" since the column id is not always set and I have a list of different names (name,name2,name3) that don't match always (sometime I have the same name but different name2). I'd like to keep both these information, but removing duplicate rows and "merging" rows in order to remove the maximum number of NaN values, without loosing any king of information.

The output should be that:

id  name    name2   name3   email
1   A       A       A       email@gmail.com
1   A       A       B       email@gmail.com

The second row is given by the merge between

NaN A       A       B       email@gmail.com
1   A       A       B       NaN

in the original dataframe.

(I already tried the solution here: How can I merge duplicate rows and fill the NaN cells with the values from the other row? but without success)

Thanks.

Paolo Magnani
  • 549
  • 4
  • 14
  • This is more of a clustering problem than simply filling missing values, more like this question really: https://stackoverflow.com/q/43927449/4800086 – Swier Sep 08 '22 at 11:47
  • 1
    @Swier I was also wondering, but in this case OP should provide a much better example (more ids, more names, etc.) – mozway Sep 08 '22 at 11:50

1 Answers1

0

Maybe the example is unclear, but IIUC, ffill and drop_duplicates:

out = df.ffill().drop_duplicates()

output:

    id name name2 name3            email
0  1.0    A     A     A  email@gmail.com
2  1.0    A     A     B  email@gmail.com
mozway
  • 194,879
  • 13
  • 39
  • 75