So, I have a dataset with some repeated data, which I need to remove. For some reason, the data I need is always in the middle:
--> df_apps
DATE | APP | DOWNLOADS | ACTIVE_USERS
______________________________________________________
2021-01-10 | FACEBOOK | 1000 | 5000
2021-01-10 | FACEBOOK | 20000 | 900000
2021-02-10 | FACEBOOK | 9000 | 72000
2021-01-11 | FACEBOOK | 4000 | 2000
2021-01-11 | FACEBOOK | 40000 | 85000
2021-02-11 | FACEBOOK | 1000 | 2000
In pandas, it'd be as simple as df_apps_grouped = df_apps.groupby('DATE').nth_value(1)
and I'd get the result bellow:
--> df_apps_grouped
DATE | APP | DOWNLOADS | ACTIVE_USERS
______________________________________________________
2021-01-10 | FACEBOOK | 20000 | 900000
2021-01-11 | FACEBOOK | 40000 | 85000
But for one specific project, I must use pyspark and I can't get this result on it. Could you please help me with this?
Thanks!