Create a column of counts in a pandas dataframe

Question

I want to create a column of counts in a pandas dataframe. Here is the input:

dict = {'id': [1,2,3,4,5,6], 'cat': ['A', 'A', 'A', 'A', 'A', 'B'], 'status': [1, 1, 1, 1, 2, 1]}

  id cat status
0 1  A   1
1 2  A   1
2 3  A   1
3 4  A   1
4 5  A   2
5 6  B   1

Preferred output:

  id cat status status_1_for_cat_count status_2_for_category_count
0 1  A   1      4                      1
1 2  A   1      4                      1
2 3  A   1      4                      1
3 4  A   1      4                      1
4 5  A   2      4                      1
5 6  B   1      1                      0

As can hopefully be seen, I'm trying to get the full counts added for each row to two columns (one for each status). I have tried several approaches, mostly with groupby in combination with unique_counts, transform, apply, filter, merges and what not, but have not been able to get this to work. I am able to do this on a single column easily (I want to create a column of value_counts in my pandas dataframe), but not with two different statuses combined with the category.

Status is different in dict and the output? – Bharath M Shetty Jan 07 '18 at 04:46 — Bharath M Shetty, Jan 07 '18 at 04:46

Bharath M Shetty · Answer 1 · 2018-01-07T05:09:40.087

4

You can use get_dummies first then groupby transform i.e

one = pd.get_dummies(df.set_index(['id','cat']).astype(str))
two = one.groupby(['cat']).transform('sum').reset_index()

   id cat  status_1  status_2
0   1   A         4         1
1   2   A         4         1
2   3   A         4         1
3   4   A         4         1
4   5   A         4         1
5   6   B         1         0

edited Jan 07 '18 at 05:09

answered Jan 07 '18 at 04:52

Bharath M Shetty

30,075
6
57
108

1

You can compress this a bit: `pd.get_dummies(df.set_index('cat').status).add_prefix('status_').groupby(level=0).transform('sum')` – cs95 Jan 07 '18 at 05:20
Hey @Dark, this answer worked well on the sample DataFrame but caused a Memory Error on the full DataFrame (its a rather large dataset with many categories and in that case get_dummies uses a fair amount of memory; it used about 6GB before it crashed). It was also more CPU intense. I didn't have those two issues with Psidom crosstab answer. It's possible that this answer works better for other use cases though. Thanks a lot! :-) – Mathijs de Jong Jan 07 '18 at 05:24
@MathijsdeJong you know get_dummies was supposed to be faster, I never got into memory error till today. Glad you got it done. And if the answer was useful upvote and support :) – Bharath M Shetty Jan 07 '18 at 05:28
adding a new way for using get_dummies :-) – BENY Jan 07 '18 at 17:05

score 4 · Accepted Answer · answered Jan 07 '18 at 04:56

4

Another option, use pd.crosstab to create a two way table with cat as index, then join back with the original data frame on cat column:

df.join(pd.crosstab(df.cat, 'status_' + df.status.astype(str)), on='cat')

#  cat  id  status  status_1  status_2
#0   A   1       1         4         1
#1   A   2       1         4         1
#2   A   3       1         4         1
#3   A   4       1         4         1
#4   A   5       2         4         1
#5   B   6       1         1         0

answered Jan 07 '18 at 04:56

Psidom

209,562
33
339
356

I certainly didn't think about crosstab – Bharath M Shetty Jan 07 '18 at 04:58
1

@Dark It's a little convenient function I use quite often. I like the use of `get_dummies` here. – Psidom Jan 07 '18 at 05:01
This worked great! Was very quick and resource friendly. I had never thought to use crosstab. :-) – Mathijs de Jong Jan 07 '18 at 05:25

Create a column of counts in a pandas dataframe

2 Answers2