Groupby and count the number of unique values (Pandas)

Question

I have a dataframe with 2 variables: ID and outcome. I'm trying to groupby ID first, and count the number of unique values of outcome within that ID.

df
ID    outcome
1      yes
1      yes
1      yes
2      no
2      yes
2      no

Expected output:

ID    yes    no
1      3     0
2      1     2

My code df[['PID', 'outcome']].groupby('PID')['outcome'].nunique() gives the number of the unique value itself, such that:

ID
1   2
2   2

But I need the counts of the yes and no, how can I achieve that? Thanks!

score 7 · Accepted Answer · answered Aug 03 '17 at 22:04

7

How about pd.crosstab?

In [1217]: pd.crosstab(df.ID, df.outcome)
Out[1217]: 
outcome  no  yes
ID              
1         0    3
2         2    1

answered Aug 03 '17 at 22:04

cs95

379,657
97
704
746

1

I have to find a new **Option 2** – piRSquared Aug 03 '17 at 22:07
@piRSquared It was only a matter of time before someone else figured out this was possible :p – cs95 Aug 03 '17 at 22:08
How about? This is great! – Lumos Aug 03 '17 at 22:13
@Kay Cheers. :) – cs95 Aug 03 '17 at 22:15

piRSquared · Answer 2 · 2017-08-03T23:21:46.267

4

Option 2
pd.factorize + np.bincount
This is convoluted and painful... but very fast.

fi, ui = pd.factorize(df.ID.values)
fo, uo = pd.factorize(df.outcome.values)

n, m = ui.size, uo.size
pd.DataFrame(
    np.bincount(fi * m + fo, minlength=n * m).reshape(n, m),
    pd.Index(ui, name='ID'), pd.Index(uo, name='outcome')
)

outcome  yes  no
ID              
1          3   0
2          1   2

Option C

pd.get_dummies(d.ID).T.dot(pd.get_dummies(d.outcome))

   no  yes
1   0    3
2   2    1

Option IV.

df.groupby(['ID', 'outcome']).size().unstack(fill_value=0)

edited Aug 03 '17 at 23:21

answered Aug 03 '17 at 22:06

piRSquared

285,575
57
475
624

that means "Terrible" hahaha – ℕʘʘḆḽḘ Aug 03 '17 at 22:41
1

@Kay transpose of the dataframe – piRSquared Aug 03 '17 at 23:03

Alexander · Answer 3 · 2017-08-03T22:14:06.067

3

Group on the ID column and then aggregate using value_counts on the outcome column. This would result in a series, so you need to convert it back to a dataframe using .to_frame() so that you can unstack the yes/no (i.e. have them as columns). Then fill null values with zero.

df_total = df.groupby('ID')['outcome'].value_counts().to_frame().unstack(fill_value=0)
df_total.columns = df_total.columns.droplevel()
>>> df_total
outcome  no  yes
ID              
1         0    3
2         2    1

edited Aug 03 '17 at 22:14

answered Aug 03 '17 at 21:58

Alexander

105,104
32
201
196

@piRSquared Yes. Thanks. – Alexander Aug 03 '17 at 22:13

score 0 · Answer 4 · answered Aug 03 '17 at 22:05

0

Use set_index and pd.concat

df1 = df.set_index('ID')
pd.concat([df1.outcome.eq('yes').sum(level=0),
          df1.outcome.ne('yes').sum(level=0)], keys=['yes','no'],axis=1).reset_index()

Output:

   ID  yes   no
0   1  3.0  0.0
1   2  1.0  2.0

answered Aug 03 '17 at 22:05

Scott Boston

147,308
15
139
187

ℕʘʘḆḽḘ · Answer 5 · 2017-08-10T18:17:13.427

0

the MOST efficient setup, that will prevent any past, present and future bugs and take advantage of FAST vectorized functions is to do the (insanely simple) following thing:

df['dummy_yes'] = df.outcome == 'yes'
df['dummy_no'] = df.outcome == 'no'

df.groupby('ID').sum()

edited Aug 10 '17 at 18:17

answered Aug 03 '17 at 22:05

ℕʘʘḆḽḘ

18,566
34
128
235

Why do you think the other solutions will cause past, present, or future bugs? – cs95 Aug 03 '17 at 22:20
this is something that works in any language. I had some severe and subtle bugs in the past when trying to do more pandonic stuff – ℕʘʘḆḽḘ Aug 03 '17 at 22:40
@coldspeed here is it https://stackoverflow.com/questions/36337012/how-to-get-multiple-conditional-operations-after-a-pandas-groupby – ℕʘʘḆḽḘ Aug 10 '17 at 18:16

Groupby and count the number of unique values (Pandas)

5 Answers5