1

I have a pandas dataframe with some columns in it. The column I am interested in is something like this,

df['col'] = ['A', 'A', 'B', 'C', 'B', 'A']

I want to make another column say, col_count such that it shows count value in col from that index to the end of the column.

The first A in the column should have a value 3 because there is 3 occurrence of A in the column from that index. The second A will have value 2 and so on.

Finally, I want to get the following result,

    col     col_count
0   A   3
1   A   2
2   B   2
3   C   1
4   B   1
5   A   1

How can I do this effectively in pandas.? I was able to do this by looping through the dataframe and taking a unique count of that value for a sliced dataframe.

Is there an efficient method to do this.? Something without loops preferable.

Another part of the question is, I have another column like this along with col,

df['X'] = [10, 40, 10, 50, 30, 20]

I want to sum up this column in the same fashion I wanted to count the column col.

For instance, At index 0, I will have 10 + 40 + 20 as the sum. At index 1, the sum will be 40 + 20. In short, instead of counting, I want to sum up another column.

The result will be like this,

    col     col_count   X   X_sum
0   A   3   10  70
1   A   2   40  60
2   B   2   10  40
3   C   1   50  50
4   B   1   30  30
5   A   1   20  20
Sreeram TP
  • 11,346
  • 7
  • 54
  • 108
  • Possible duplicate of [what is the most efficient way of counting occurrences in pandas?](https://stackoverflow.com/questions/20076195/what-is-the-most-efficient-way-of-counting-occurrences-in-pandas) – Deepak Chauhan Jul 31 '19 at 05:24
  • This is not a duplicate, Actually the other question is not at all related to this one – Sreeram TP Jul 31 '19 at 05:36

1 Answers1

2

Use pandas.Series.groupby with cumcount and cumsum.

g = df[::-1].groupby('col')
df['col_count'] = g.cumcount().add(1)
df['X_sum'] = g['X'].cumsum()
print(df)

Output:

  col   X  col_count  X_sum
0   A  10          3     70
1   A  40          2     60
2   B  10          2     40
3   C  50          1     50
4   B  30          1     30
5   A  20          1     20
Chris
  • 29,127
  • 3
  • 28
  • 51
  • This is not the result I am looking for – Sreeram TP Jul 31 '19 at 05:08
  • Looks good, Let me test for some cases. Also I have a part 2 for this question I will update for that – Sreeram TP Jul 31 '19 at 05:15
  • @SreeramTP Updated. Note that `cumcount` has also been changed. Also, it is not so recommended to post 2 part question. Reading [this post](https://meta.stackexchange.com/questions/39223/one-post-with-multiple-questions-or-multiple-posts) may be helpful. – Chris Jul 31 '19 at 05:45
  • Yeah sure. I thought since both are similar cases I will put it in same Q – Sreeram TP Jul 31 '19 at 05:46
  • What is the relevance of [::-1]. Why did u reverse the order.? – Sreeram TP Jul 31 '19 at 05:47
  • 1
    Because the desired output of both `cumcount` and `cumsum` are in descending order. `cumcount` by default supports descending order of count but `cumsum` doesn't. Instead of applying `[::-1]` to `cumsum` only, applying it to the entire dataset grants you consistency in the use of both method. – Chris Jul 31 '19 at 05:50