Merging nearly-duplicate rows of strings on index in Pandas?

Question

I have a dataset that has 2 copies of each record. Each record has an ID, and each copy has the same ID.

15 of 18 fields are identical in both copies of the records. But in 3 fields, the top row contains 2 items and 1 NAN; the bottom row contains 1 item (where top row had a NAN) and 2 NANs (where top row had items). Sometimes there are random NANs that don't follow this pattern.

I need to collapse each record into one so that I have a single record that contains all 3 non-NAN fields.

I have tried various versions of groupby. But that omits the 3 fields I need, which are all string-based. And it doubles the values of certain numeric fields.

If all else fails, I'll turn the letter fields into number codes and df.groupby(['ID']).agg('sum')

But I figure there's probably a smarter way to do this.

Welcome to stack overflow! Please have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide a [mcve] including sample input and sample output, so we can better understand your question — G. Anderson, Jan 28 '20 at 19:53
The aggregation you want is `first`: `df.groupby('ID').agg('first')`. Also, summing strings isn't good: https://stackoverflow.com/questions/3525359/python-sum-why-not-strings — ALollz, Jan 28 '20 at 20:06
Can you add a sample of the dataset? And have you tried a for-loop function to fill the nan with the non-nan string of the same id? — Renate van Kempen, Jan 28 '20 at 20:32

Merging nearly-duplicate rows of strings on index in Pandas?

0 Answers0