0

I'm processing data file with Pandas and trying to output data in a grouped form, pretty much like SQL grouping function do, a single unique record per group

I have a data file containing some stats of fake games. I am trying to group the games by their publish year and display 1 record per group with the year and the number of games for that year. So far I could count the games but the output contains all of the records counted like this:

import pandas as pd

print("*** All data count ***")
data = pd.read_csv('../input/games.csv')
print(len(data))

print("*** No duplicates ***")
no_dups = data.drop_duplicates()
print(len(no_dups))

print("*** drop unused columns: type and name for better ")
no_shit = no_dups.drop(['type', 'name', ], axis=1)

print("*** Invalid removed")
cols = ['yearpublished']
no_shit[cols] = no_shit[no_shit[cols] > 0][cols]
clean_data = no_shit.dropna()
print(len(clean_data))

print("*** Valid sorted ***")
sorted_data = clean_data.sort_values(cols)
sorted_data['title_count'] = sorted_data.groupby('yearpublished')['id'].cumcount() + 1

print(sorted_data.tail(20))

the output right now looks like the following so my code is actually grouping records but displays all of them instead just the last one.

>            id  yearpublished     ...       average_weight  title_count 78848  177659         2016.0     ...               4.0000          296
> 81003  183415         2016.0     ...               0.0000          297
> 79401  179448         2016.0     ...               0.0000          298
> 81107  183684         2017.0     ...               0.0000            1
> 79706  180185         2017.0     ...               0.0000            2
> 80431  181888         2017.0     ...               0.0000            3
> 80676  182408         2017.0     ...               0.0000            4
> 62324  127709         2017.0     ...               2.6667            5
> 76115  170599         2017.0     ...               2.0000            6
> 77249  173635         2017.0     ...               0.0000            7
> 80039  181006         2017.0     ...               0.0000            8
> 65192  135986         2017.0     ...               4.0000            9
> 79263  178958         2017.0     ...               0.0000           10
> 64446  133601         2017.0     ...               0.0000           11
> 64447  133602         2017.0     ...               0.0000           12
> 81247  184151         2017.0     ...               0.0000           13
> 80677  182409         2017.0     ...               0.0000           14
> 79942  180797         2018.0     ...               0.0000            1
> 81294  184349         2018.0     ...               0.0000            2
> 80092  181140         2018.0     ...               0.0000            3

I would like to have this output instad:

 id  yearpublished     ...       average_weight  title_count
79401  179448         2016.0     ...               0.0000          298
80677  182409         2017.0     ...               0.0000           14
80092  181140         2018.0     ...               0.0000            3
smoczyna
  • 489
  • 6
  • 18
  • What is expected output from sample data? Is possible create [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) ? – jezrael Jan 29 '19 at 08:11
  • 1
    So need `df = df.drop_duplicates('yearpublished', keep='last')` ? – jezrael Jan 29 '19 at 08:21
  • that fills the bill indeed, thanks. You should actually answer my question instead so I could mark your answer as the right one. Also my post is not exactly a duplicate of the one you linked at the beginning. – smoczyna Jan 29 '19 at 14:32
  • Yes, for me it should be nice too, but if me not then some another user close your question :( – jezrael Jan 29 '19 at 14:34

0 Answers0