1

I want to create a column of counts in a pandas dataframe. Here is the input:

dict = {'id': [1,2,3,4,5,6], 'cat': ['A', 'A', 'A', 'A', 'A', 'B'], 'status': [1, 1, 1, 1, 2, 1]}

  id cat status
0 1  A   1
1 2  A   1
2 3  A   1
3 4  A   1
4 5  A   2
5 6  B   1 

Preferred output:

  id cat status status_1_for_cat_count status_2_for_category_count
0 1  A   1      4                      1
1 2  A   1      4                      1
2 3  A   1      4                      1
3 4  A   1      4                      1
4 5  A   2      4                      1
5 6  B   1      1                      0

As can hopefully be seen, I'm trying to get the full counts added for each row to two columns (one for each status). I have tried several approaches, mostly with groupby in combination with unique_counts, transform, apply, filter, merges and what not, but have not been able to get this to work. I am able to do this on a single column easily (I want to create a column of value_counts in my pandas dataframe), but not with two different statuses combined with the category.

2 Answers2

4

You can use get_dummies first then groupby transform i.e

one = pd.get_dummies(df.set_index(['id','cat']).astype(str))
two = one.groupby(['cat']).transform('sum').reset_index()

   id cat  status_1  status_2
0   1   A         4         1
1   2   A         4         1
2   3   A         4         1
3   4   A         4         1
4   5   A         4         1
5   6   B         1         0
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • 1
    You can compress this a bit: `pd.get_dummies(df.set_index('cat').status).add_prefix('status_').groupby(level=0).transform('sum')` – cs95 Jan 07 '18 at 05:20
  • Hey @Dark, this answer worked well on the sample DataFrame but caused a Memory Error on the full DataFrame (its a rather large dataset with many categories and in that case get_dummies uses a fair amount of memory; it used about 6GB before it crashed). It was also more CPU intense. I didn't have those two issues with Psidom crosstab answer. It's possible that this answer works better for other use cases though. Thanks a lot! :-) – Mathijs de Jong Jan 07 '18 at 05:24
  • @MathijsdeJong you know get_dummies was supposed to be faster, I never got into memory error till today. Glad you got it done. And if the answer was useful upvote and support :) – Bharath M Shetty Jan 07 '18 at 05:28
  • adding a new way for using get_dummies :-) – BENY Jan 07 '18 at 17:05
4

Another option, use pd.crosstab to create a two way table with cat as index, then join back with the original data frame on cat column:

df.join(pd.crosstab(df.cat, 'status_' + df.status.astype(str)), on='cat')

#  cat  id  status  status_1  status_2
#0   A   1       1         4         1
#1   A   2       1         4         1
#2   A   3       1         4         1
#3   A   4       1         4         1
#4   A   5       2         4         1
#5   B   6       1         1         0
Psidom
  • 209,562
  • 33
  • 339
  • 356