0

I have a pandas dataframe with a numeric (Col1) and a categorical (Col2) column.

Col1 Col2
0 0 B
1 1 B
2 2 B
3 3.7 C
4 4 C
5 5 B
6 6 B
7 7 B
8 8 A
9 9 B

For each row of that dataframe, I need to take the mean of N previous elements in the numeric column (Col1) where the values in the categorical column (Col2) are equal to the row's value in that column.

If we do not have the previous N elements available, then calculate the mean among the available elements, or put zero if we have not encountered that category yet. For example, in the row with id = 3, we have not come across the 'C' category yet, so the mean is zero. However, in the row with id = 4, we have only one 'C'-type observation before, so regardless of N, the mean is 3.7.

I can implement this by iterating through the rows, but is there a more efficient way to achieve this using pandas?

EDIT
The suggested solution

N = 3
df.groupby('Col2')['Col1'].transform(lambda g: g.rolling(N-1).mean().shift())

doesn't fully result in what I needed. Consider the example:

Col1 Col2 mean
0 1 B nan
1 2 B nan
2 3.7 C nan
3 4 A nan
4 5 B 1.5

For N=3, and id = 1, we cannot calculate the mean of N-1 =2 previous "B" elements because there's only one "B" element before. However, in such a case, I need the mean of available n < N-1 elements. In this example, it's just mean([1]) = 1

0 Answers0