Pandas rolling mean in subset of dataframe based on category

Question

I have a pandas dataframe with a numeric (Col1) and a categorical (Col2) column.

	Col1	Col2
0	0	B
1	1	B
2	2	B
3	3.7	C
4	4	C
5	5	B
6	6	B
7	7	B
8	8	A
9	9	B

For each row of that dataframe, I need to take the mean of N previous elements in the numeric column (Col1) where the values in the categorical column (Col2) are equal to the row's value in that column.

If we do not have the previous N elements available, then calculate the mean among the available elements, or put zero if we have not encountered that category yet. For example, in the row with id = 3, we have not come across the 'C' category yet, so the mean is zero. However, in the row with id = 4, we have only one 'C'-type observation before, so regardless of N, the mean is 3.7.

I can implement this by iterating through the rows, but is there a more efficient way to achieve this using pandas?

EDIT
The suggested solution

N = 3
df.groupby('Col2')['Col1'].transform(lambda g: g.rolling(N-1).mean().shift())

doesn't fully result in what I needed. Consider the example:

	Col1	Col2	mean
0	1	B	nan
1	2	B	nan
2	3.7	C	nan
3	4	A	nan
4	5	B	1.5

For N=3, and id = 1, we cannot calculate the mean of N-1 =2 previous "B" elements because there's only one "B" element before. However, in such a case, I need the mean of available n < N-1 elements. In this example, it's just mean([1]) = 1

`df.groupby('Col2')['Col1'].rolling(N).mean()` and for the shifted rolling mean `df.groupby('Col2')['Col1'].transform(lambda g: g.rolling(N-1).mean().shift())` — mozway, Aug 21 '23 at 07:58
@mozway, it's quite close but not exactly what I need. I've made the edit — almostdeitascientist, Aug 21 '23 at 08:25
```np.random.seed(6); df = pd.DataFrame({'Col1' : [1, 2, 3.7, 4, 5, 6, 7, 8, 9, 10], 'Col2': np.random.choice(['A', 'B', 'B', 'C'], size=10) })``` — almostdeitascientist, Aug 21 '23 at 09:10

Pandas rolling mean in subset of dataframe based on category

0 Answers0