pandas group by and find first non null value for all columns

Question

I have pandas DF as below ,

id  age   gender  country  sales_year
1   None   M       India    2016
2   23     F       India    2016
1   20     M       India    2015
2   25     F       India    2015
3   30     M       India    2019
4   36     None    India    2019

I want to group by on id, take the latest 1 row as per sales_date with all non null element.

output expected,

id  age   gender  country  sales_year
1   20     M       India    2016
2   23     F       India    2016
3   30     M       India    2019
4   36     None    India    2019

In pyspark,

df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))

But i need same solution in pandas .

EDIT :: This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.

your output still contains `None` values, unless I am missing something — quant, Nov 26 '19 at 10:08
if none of the row have any valid data then **None** is fine.. But if available it should detect the way in the example for id 1, age got replaced from second highest year data. — j ', Nov 26 '19 at 10:10

jezrael · Accepted Answer · 2019-11-26T10:28:51.310

15

Use GroupBy.first:

df1 = df.groupby('id', as_index=False).first()
print (df1)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

If column sales_year is not sorted:

df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

edited Nov 26 '19 at 10:28

answered Nov 26 '19 at 10:16

jezrael

822,522
95
1,334
1,252

How this is sorting the data as per **sales_year** ? I need latest year's row as first. – j ' Nov 26 '19 at 10:22
is there any way, i can do this for all the columns ? I have 20 + columns like this. i wish to do it in one go for all of them.. as sorting remains on same column 'sales_year' and group by also remain same on 'id' column. – j ' Nov 26 '19 at 10:28
1

@j' - It seems you want original solution with sorting before? Because `first` return first non `None` or not `NaN`s value per all columns, so it is reson because `2016` in first row. – jezrael Nov 26 '19 at 10:30

davidbilla · Answer 2 · 2019-11-26T10:32:42.930

1

print(df.replace('None',np.NaN).groupby('id').first())

first replace the 'None' with NaN
next use groupby() to group by 'id'
next filter out the first row using first()

edited Nov 26 '19 at 10:32

answered Nov 26 '19 at 10:20

davidbilla

2,120
1
15
26

Vivek Kalyanarangan · Answer 3 · 2019-11-26T10:23:43.717

0

Use -

df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()

Output

id
1    20
2    23
3    30
4    36
Name: age, dtype: object

Remove the ['age'] to get full rows -

df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()

Output

   age gender country  sales_year
id                               
1   20      M   India        2015
2   23      F   India        2016
3   30      M   India        2019
4   36   None   India        2019

You can put the id back as a column with reset_index() -

df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()

Output

   id age gender country  sales_year
0   1  20      M   India        2015
1   2  23      F   India        2016
2   3  30      M   India        2019
3   4  36   None   India        2019

edited Nov 26 '19 at 10:23

answered Nov 26 '19 at 10:12

Vivek Kalyanarangan

8,951
1
23
42

id a column here and not index. Can't change it – j ' Nov 26 '19 at 10:17
Added `reset_index()` @j' – Vivek Kalyanarangan Nov 26 '19 at 10:18
Also, df.dropna() will remove all the rows havinf atleast 1 None value.. I don't want that.. i have 20 more columns in this DF. Given solution is not working – j ' Nov 26 '19 at 10:19
You can specify column subset in `dropna()` with `subset` parameter. Updated ans pls check – Vivek Kalyanarangan Nov 26 '19 at 10:23

pandas group by and find first non null value for all columns

3 Answers3

Linked

Related