pandas df code is
Data = data[data['ObservationDate'] == max(data['ObservationDate'])].reset_index()
Data_world = Data.groupby(["ObservationDate"])["Confirmed","Active_case","Recovered","Deaths"].sum().reset_index()
Data_world
Dataframe structure is this.
SNo ObservationDate Province/State Country/Region Last Update Confirmed Deaths Recovered Active_case
0 1 01/22/2020 Anhui China 1/22/2020 17:00 1 0 0 1
1 2 01/22/2020 Beijing China 1/22/2020 17:00 14 0 0 14
2 3 01/22/2020 Chongqing China 1/22/2020 17:00 6 0 0 6
3 4 01/22/2020 Fujian China 1/22/2020 17:00 1 0 0 1
4 5 01/22/2020 Gansu China 1/22/2020 17:00 0 0 0 0
and want output like this
ObservationDate Confirmed Active_case Recovered Deaths
0 03/22/2020 335957 223441 97882 14634
How to filter on max date?
max_date = df.select(max("ObservationDate")).first()
group_data = df.groupBy("ObservationDate")
group_data.agg({'Confirmed':'sum', 'Deaths':'sum', 'Recovered':'sum', 'Active_case':'sum'}).show()