4

I have a dataframe which is a list of organisations (multiple rows per organisation) which has a boolean value associated with it.

So organisation A has 3 true rows and 3 false i.e.

Organisation  Value
A             True
A             True
A             False
A             False
A             True
A             False
B             True
B             True
C             False
C             False

I want to get it into the format where I have each organisation listed only once, and the true and false values under the value column as new column headers like this (and then counted):

Organisation  True   False
A              3       3
B              2       0
C              0       2

Currently by code is like this:

sqlDf.groupby(['Organisation','Value']).size().reset_index(name='counts')

But that doesnt do what I need it to do.

Any advice?

Nicholas
  • 3,517
  • 13
  • 47
  • 86

2 Answers2

4

You can add unstack with parameter fill_value=0, also if importan ordering of values add sort=False to groupby:

df = df.groupby(['Organisation','Value'], sort=False).size().unstack(fill_value=0)
print (df)
Value         True   False
Organisation              
A                 3      3
B                 2      0
C                 0      2

df = (
      df.groupby(['Organisation','Value'], sort=False)
        .size()
        .unstack(fill_value=0)
        .reset_index()
        .rename_axis(None, axis=1)
      )
print (df)
  Organisation  True  False
0            A     3      3
1            B     2      0
2            C     0      2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

Option 1
pd.crosstab -

pd.crosstab(df.Organisation, df.Value)

Value         False  True 
Organisation              
A                 3      3
B                 0      2
C                 2      0

Option 2
Tensor dot products with get_dummies -

df.Organisation.str.get_dummies().T.dot(pd.get_dummies(df.Value))

   False  True 
A      3      3
B      0      2
C      2      0

Credit to @piRSquared, picked this neat trick up from him a while back, and have used it in a few answers to date.


Option 3
get_dummies + sum along the index -

pd.get_dummies(df.set_index('Organisation').Value).sum(level=0)

              False  True 
Organisation              
A                 3      3
B                 0      2
C                 2      0

Option 4
Similar to jezrael's, but uses value_counts and only groups on one column, so it should be performant.

df.groupby('Organisation').Value.value_counts().unstack(fill_value=0)

Value         False  True 
Organisation              
A                 3      3
B                 0      2
C                 2      0

Option 5
pivot_table

df.pivot_table(index='Organisation', columns='Value', aggfunc='size', fill_value=0)

Value         False  True 
Organisation              
A                 3      3
B                 0      2
C                 2      0

In this case, pivot_table lets you specify only two params, the index and columns, since you're aggregating on the number of values per cell.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thank you Coldspeed! Your answer works perfectly too! Now I am stuck having to choose between yours and jezraels answer as the accepted one as they both work perfectly. I guess yours is shorter! – Nicholas Jan 23 '18 at 13:44
  • 1
    @ScoutEU Added option 4, just fyi ;-) – cs95 Jan 23 '18 at 13:57
  • Hmmm, so what solution is the fastest? – jezrael Jan 23 '18 at 13:58
  • @jezrael Ok. I have 5 options. Which is similar? Third? Last? – cs95 Jan 23 '18 at 14:03
  • @jezrael Hmm, okay, I'll sign it then. Since we both ended up answering this one, let's excuse each other :) – cs95 Jan 23 '18 at 14:04
  • 1
    Yes, I answer because a bit difference with `sort=False` in groupby, but now see it is not important there ;) So it happens ;) Good luck! – jezrael Jan 23 '18 at 14:08