8

I have pandas DF as below ,

id  age   gender  country  sales_year
1   None   M       India    2016
2   23     F       India    2016
1   20     M       India    2015
2   25     F       India    2015
3   30     M       India    2019
4   36     None    India    2019

I want to group by on id, take the latest 1 row as per sales_date with all non null element.

output expected,

id  age   gender  country  sales_year
1   20     M       India    2016
2   23     F       India    2016
3   30     M       India    2019
4   36     None    India    2019

In pyspark,

df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))

But i need same solution in pandas .

EDIT :: This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.

j '
  • 191
  • 1
  • 2
  • 12
  • your output still contains `None` values, unless I am missing something – quant Nov 26 '19 at 10:08
  • if none of the row have any valid data then **None** is fine.. But if available it should detect the way in the example for id 1, age got replaced from second highest year data. – j ' Nov 26 '19 at 10:10

3 Answers3

15

Use GroupBy.first:

df1 = df.groupby('id', as_index=False).first()
print (df1)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

If column sales_year is not sorted:

df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • How this is sorting the data as per **sales_year** ? I need latest year's row as first. – j ' Nov 26 '19 at 10:22
  • is there any way, i can do this for all the columns ? I have 20 + columns like this. i wish to do it in one go for all of them.. as sorting remains on same column 'sales_year' and group by also remain same on 'id' column. – j ' Nov 26 '19 at 10:28
  • 1
    @j' - It seems you want original solution with sorting before? Because `first` return first non `None` or not `NaN`s value per all columns, so it is reson because `2016` in first row. – jezrael Nov 26 '19 at 10:30
1
print(df.replace('None',np.NaN).groupby('id').first())
  • first replace the 'None' with NaN
  • next use groupby() to group by 'id'
  • next filter out the first row using first()
davidbilla
  • 2,120
  • 1
  • 15
  • 26
0

Use -

df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()

Output

id
1    20
2    23
3    30
4    36
Name: age, dtype: object

Remove the ['age'] to get full rows -

df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()

Output

   age gender country  sales_year
id                               
1   20      M   India        2015
2   23      F   India        2016
3   30      M   India        2019
4   36   None   India        2019

You can put the id back as a column with reset_index() -

df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()

Output

   id age gender country  sales_year
0   1  20      M   India        2015
1   2  23      F   India        2016
2   3  30      M   India        2019
3   4  36   None   India        2019
Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42