Get counts by group using pandas

Question

I have a pandas dataframe which contains data as shown below:

ID  year_month_id   Class
1   201612          A
2   201612          D
3   201612          B
4   201612          Other
5   201612          Other
6   201612          Other
7   201612          A
8   201612          Other
9   201612          A
1   201701          B

So an ID can be under any class in a particular month and next month his class might change. Now what I want to do is for each ID get the number of months it has been under a particular class and also the latest class under which it falls. Something like below:

ID  Class_A Class_B Class_D Other Latest_Class
1   2        3       4         0    B
2   12       0       0         0    D

How do I achieve this in python. Can someone please help me with this? Also , since the real dataset is huge and manually verifying is not possible, how can I get a list of ID's which fall under more than 1 class?

Bharath M Shetty · Accepted Answer · 2017-12-20T12:52:53.360

4

We can use pivot table and concat i.e

ndf = df.pivot_table(index=['ID'],columns=['Class'],aggfunc='count',fill_value=0)\
    .xs('year_month_id', axis=1, drop_level=True)

ndf['latest'] = df.sort_values('ID').groupby('ID')['Class'].tail(1).values

Class  A  B  D  Other latest
ID                          
1      1  1  0      0      B
2      0  0  1      0      D
3      0  1  0      0      B
4      0  0  0      1  Other
5      0  0  0      1  Other
6      0  0  0      1  Other
7      1  0  0      0      A
8      0  0  0      1  Other
9      1  0  0      0      A

edited Dec 20 '17 at 12:52

answered Dec 20 '17 at 12:47

Bharath M Shetty

30,075
6
57
108

1

Using `pivot` is a good option here, should be the fastest I'm guessing. – cs95 Dec 20 '17 at 12:48
1

When one pivots just 2 columns and uses `count` as aggfunc, filling with zeroes (exactly this case) it is worth considering using `pd.crosstab`. – jo9k Dec 20 '17 at 13:35
thanks a lot @Dark. Since the data is huge and I can't manually go and check if the output is correct for each ID, how can i get a list of ID's which have entries as 1 in more than 1 column. – Shuvayan Das Dec 20 '17 at 13:56

jezrael · Answer 2 · 2017-12-20T12:51:47.553

You can get counts by groupby with aggregate count, reshape by unstack. Last add new column with drop_duplicates:

df1 = df.groupby(['ID','Class'])['year_month_id'].count().unstack(fill_value=0)
df1['Latest_Class'] = df.drop_duplicates('ID', keep='last').set_index('ID')['Class']
print (df1)
Class  A  B  D  Other Latest_Class
ID                                
1      1  1  0      0            B
2      0  0  1      0            D
3      0  1  0      0            B
4      0  0  0      1        Other
5      0  0  0      1        Other
6      0  0  0      1        Other
7      1  0  0      0            A
8      0  0  0      1        Other
9      1  0  0      0            A

Downvoter, if there's something wrong with my answer, please let me know, so I can correct it. Thanks. — jezrael, Dec 20 '17 at 13:09

cs95 · Answer 3 · 2017-12-21T16:10:16.393

3

You can get a count of classes attended with groupby + value_counts + unstack -

g = df.groupby('ID')
i = g.Class.value_counts().unstack(fill_value=0)

To get the last Class, use groupby + last -

j = g.Class.last()

Concatenate to get your result -

pd.concat([i, j], 1).rename(columns={'Class': 'LastClass'})

    A  B  D  Other LastClass
ID                          
1   1  1  0      0         B
2   0  0  1      0         D
3   0  1  0      0         B
4   0  0  0      1     Other
5   0  0  0      1     Other
6   0  0  0      1     Other
7   1  0  0      0         A
8   0  0  0      1     Other
9   1  0  0      0         A

To get a list of IDs which have more than 1 per row, use sum + a mask -

k = i.sum(axis=1)
k[k > 1]

ID
1    2
dtype: int64

edited Dec 21 '17 at 16:10

answered Dec 20 '17 at 12:46

cs95

379,657
97
704
746

1

Downvoter, if there's something wrong with answer, please let me know, so I can correct it. Thanks. – jezrael Dec 20 '17 at 13:09
2

@jezrael Someone mistook christmas for april fool's day. – cs95 Dec 20 '17 at 13:11

jo9k · Answer 4 · 2017-12-20T13:52:17.117

When one pivots just 2 columns and uses count as aggfunc, filling missing entries with zeroes (exactly this case) it is worth considering using pd.crosstab:

 >> new_df = pd.crosstab(df.ID, df.Class)
 >> new_df
Class  A  B  D  Other
ID
1      1  1  0      0
2      0  0  1      0
3      0  1  0      0
4      0  0  0      1
5      0  0  0      1
6      0  0  0      1
7      1  0  0      0
8      0  0  0      1
9      1  0  0      0

You get last value of class from initial dataframe with grouping by ID and choosing last entry:

>> df.groupby('ID').Class.last()
ID
1        B
2        D
3        B
4    Other
5    Other
6    Other
7        A
8    Other
9        A

Then you can put them together with concatenation:

>> new_df = pd.concat([new_df, df.groupby('ID').Class.last()], 1)
    A  B  D  Other  Class
ID
1   1  1  0      0      B
2   0  0  1      0      D
3   0  1  0      0      B
4   0  0  0      1  Other
5   0  0  0      1  Other
6   0  0  0      1  Other
7   1  0  0      0      A
8   0  0  0      1  Other
9   1  0  0      0      A

And to get output exactly as you wanted it:

>> new_df = new_df.rename(columns={'Class':'LastClass'})
    A  B  D  Other LastClass
ID
1   1  1  0      0         B
2   0  0  1      0         D
3   0  1  0      0         B
4   0  0  0      1     Other
5   0  0  0      1     Other
6   0  0  0      1     Other
7   1  0  0      0         A
8   0  0  0      1     Other
9   1  0  0      0         A

Putting all together as oneliner:

>> new_df = pd.concat([pd.crosstab(df.ID, df.Class),df.groupby('ID').Class.last()],1).rename(columns={'Class':'LastClass'})

>> new_df
    A  B  D  Other LastClass
ID
1   1  1  0      0         B
2   0  0  1      0         D
3   0  1  0      0         B
4   0  0  0      1     Other
5   0  0  0      1     Other
6   0  0  0      1     Other
7   1  0  0      0         A
8   0  0  0      1     Other
9   1  0  0      0         A

Get counts by group using pandas

4 Answers4