I'm processing data file with Pandas and trying to output data in a grouped form, pretty much like SQL grouping function do, a single unique record per group
I have a data file containing some stats of fake games. I am trying to group the games by their publish year and display 1 record per group with the year and the number of games for that year. So far I could count the games but the output contains all of the records counted like this:
import pandas as pd
print("*** All data count ***")
data = pd.read_csv('../input/games.csv')
print(len(data))
print("*** No duplicates ***")
no_dups = data.drop_duplicates()
print(len(no_dups))
print("*** drop unused columns: type and name for better ")
no_shit = no_dups.drop(['type', 'name', ], axis=1)
print("*** Invalid removed")
cols = ['yearpublished']
no_shit[cols] = no_shit[no_shit[cols] > 0][cols]
clean_data = no_shit.dropna()
print(len(clean_data))
print("*** Valid sorted ***")
sorted_data = clean_data.sort_values(cols)
sorted_data['title_count'] = sorted_data.groupby('yearpublished')['id'].cumcount() + 1
print(sorted_data.tail(20))
the output right now looks like the following so my code is actually grouping records but displays all of them instead just the last one.
> id yearpublished ... average_weight title_count 78848 177659 2016.0 ... 4.0000 296
> 81003 183415 2016.0 ... 0.0000 297
> 79401 179448 2016.0 ... 0.0000 298
> 81107 183684 2017.0 ... 0.0000 1
> 79706 180185 2017.0 ... 0.0000 2
> 80431 181888 2017.0 ... 0.0000 3
> 80676 182408 2017.0 ... 0.0000 4
> 62324 127709 2017.0 ... 2.6667 5
> 76115 170599 2017.0 ... 2.0000 6
> 77249 173635 2017.0 ... 0.0000 7
> 80039 181006 2017.0 ... 0.0000 8
> 65192 135986 2017.0 ... 4.0000 9
> 79263 178958 2017.0 ... 0.0000 10
> 64446 133601 2017.0 ... 0.0000 11
> 64447 133602 2017.0 ... 0.0000 12
> 81247 184151 2017.0 ... 0.0000 13
> 80677 182409 2017.0 ... 0.0000 14
> 79942 180797 2018.0 ... 0.0000 1
> 81294 184349 2018.0 ... 0.0000 2
> 80092 181140 2018.0 ... 0.0000 3
I would like to have this output instad:
id yearpublished ... average_weight title_count
79401 179448 2016.0 ... 0.0000 298
80677 182409 2017.0 ... 0.0000 14
80092 181140 2018.0 ... 0.0000 3