1

I have a dataset that has 2 copies of each record. Each record has an ID, and each copy has the same ID.

15 of 18 fields are identical in both copies of the records. But in 3 fields, the top row contains 2 items and 1 NAN; the bottom row contains 1 item (where top row had a NAN) and 2 NANs (where top row had items). Sometimes there are random NANs that don't follow this pattern.

I need to collapse each record into one so that I have a single record that contains all 3 non-NAN fields.

I have tried various versions of groupby. But that omits the 3 fields I need, which are all string-based. And it doubles the values of certain numeric fields.

If all else fails, I'll turn the letter fields into number codes and df.groupby(['ID']).agg('sum')

But I figure there's probably a smarter way to do this.

kmcoding
  • 11
  • 2
  • Welcome to stack overflow! Please have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide a [mcve] including sample input and sample output, so we can better understand your question – G. Anderson Jan 28 '20 at 19:53
  • The aggregation you want is `first`: `df.groupby('ID').agg('first')`. Also, summing strings isn't good: https://stackoverflow.com/questions/3525359/python-sum-why-not-strings – ALollz Jan 28 '20 at 20:06
  • Can you add a sample of the dataset? And have you tried a for-loop function to fill the nan with the non-nan string of the same id? – Renate van Kempen Jan 28 '20 at 20:32

0 Answers0