3

I have an interesting question for which I am not getting any built-in solution in Pandas.

I have data like:

A B    C    D    E     F
1 null null Red  null  12
1 89   76   Red  null  null
1 null null null null  12

I am trying to get an output like:

1 89 76 Red null 12

Now, these data are messages coming at different times. Many messages belong to a single transaction.

So, suppose A is the identifier of the transaction. Some messages with the same identifier have a value for column B, some have value for column C.

So, I want to group by the unique identifier and then get the non-null values for each column.

So, this is a grouping and then aggregating task. But unlike count or things like that, I am really creating a new row.

I am not finding a way to do so in pandas. Any help is greatly appreciated, Thanks!

Example:>

    A   B        C       D
    1   null    null    100
    1   90      null    null
    2   null    1000    999
    1            80 
    2   1000    null    799

 df.groupby('A').first()
      B     C    D
A
1  null  null  100
2  null  1000  999

But the required output is:

     B     C    D
A
1  90     80  100
2  1000  1000  999
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
david nadal
  • 279
  • 4
  • 16
  • 2
    Something like `df.groupby('A').first()`? This might be a duplicate. – cs95 Jun 05 '18 at 06:02
  • 1
    what is the final data, that you are expecting ? – Sudipta Mondal Jun 05 '18 at 06:02
  • 1
    Can you confirm what it is you're looking for? – cs95 Jun 05 '18 at 06:06
  • 1
    @coldspeed i added the sample output : just the non null values from each row with the same id – david nadal Jun 05 '18 at 06:37
  • @jezrael its not a duplicate... how can it be the same? the questions ask two different things..i ahve updated the question with a sample output for clairty – david nadal Jun 05 '18 at 06:38
  • @davidnadal - Do you think `df.replace('null', np.nan).groupby('A').first()` ? – jezrael Jun 05 '18 at 06:40
  • @jezrael no..i dont want to replace null by np nan..its more like a merge between the rows with the same id after the groupby ..i want to extract all the non null values for each column after grouping them with the same id ... and then construct a single row for each group.. – david nadal Jun 05 '18 at 06:43
  • OK, is possible more `non-nans` values like in last column `12, null, 15` ? – jezrael Jun 05 '18 at 06:45
  • @jezrael, its possible..then i just want to take the first one I guess..although I am sure that scenario will be quite rare in the data I have....anyways, i was worried about that condition and selecting the first value there will work.. – david nadal Jun 05 '18 at 06:47
  • 1
    @davidnadal - But `df.groupby('A').first()` does not work? Why? I test it and working nice. – jezrael Jun 05 '18 at 06:47
  • @jezrael http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.GroupBy.first.html Groupby first just takes the first row of "all the group rows" with the same id, right? It doesn't take first non-null values for each column.. Like here https://stackoverflow.com/questions/30486417/pandas-how-do-i-select-first-row-in-each-group-by-group/30492281?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa – david nadal Jun 05 '18 at 06:49
  • @davidnadal - hmmm, but it working as you need - it return first non nan. Try test it in small dataframe – jezrael Jun 05 '18 at 06:50
  • @jezrael okay, let me test on the dataset..thanks! – david nadal Jun 05 '18 at 06:51
  • @jezrael no..it doesnt work..i tested on a csv – david nadal Jun 05 '18 at 08:25
  • `null` are `NaN`s ? – jezrael Jun 05 '18 at 08:25
  • @jezrael i have given an example too – david nadal Jun 05 '18 at 08:27
  • So need `df.replace('null', np.nan).groupby('A').first()` – jezrael Jun 05 '18 at 08:27
  • @jezrael it works on this example!! could you explain as answer maybe, why is this happening, what does np.nan replacmenet have to do with this? – david nadal Jun 05 '18 at 08:30
  • 1
    because `first` get first non nan values per grouops – jezrael Jun 05 '18 at 08:41
  • Nobody answered @davidnadal. Hi David, if you are still wondering how to solve your problem: here is the answer (i've tested it): `df = df.groupby('A').apply(lambda x: x.bfill().ffill())` `df.groupby('A').last()` – Hernán Borré Apr 24 '19 at 08:10

0 Answers0