Groupby and find mean of a column rowwise subsetting that row python

Question

I have a problem in python. This is my sample data

     col1  col2  desired
0    a     1     2.50
1    a     2     2.00
2    a     3     1.50
3    b     2     3.00
4    b     3     2.00
5    c     3     1.67
6    c     1     2.33
7    c     2     2.00
8    c     2     2.00

the input are df['col1'] and df['col2']. I want to use these two columns to produce the desired outcome in df['desired'].

The idea is, that I want to group by col1 and calculate the average value of col2. The only tweak here, though, is that I want to exclude the current row from the average value calculation.

So for row 0, I am grouping by df['col1'] == 'a', but only use row 1 and 2 to calculate the average. For row 1, I also group by df['col1'] == 'a', but I only use row 1 and 3. And so forth.

The only thing I can think of, is to create a custom function for .transform() that will take as input the series coming in from the grouped object, but I am not sure how to approach it. Ideally, I am looking for a simpler (pandas?) method to achieve this.

saw the edit. yes that example is precisely what i meant. corresponding to each row basically subset it from the grouped mean calculation. — Varun Rajan, Dec 18 '18 at 14:58
https://stackoverflow.com/questions/30274561/pandas-aggregating-average-while-excluding-current-row — BENY, Dec 18 '18 at 15:04
had not found the above one. must have been using the wrong key words. but yes this is a duplicate — Varun Rajan, Dec 18 '18 at 15:08

score 3 · Accepted Answer · answered Dec 18 '18 at 14:56

Solution working with definition of mean - sum/count.

So first get count by transform and subtract 1 for remove actual row, same with sum for remove actual row value. Last divide and assign to new column:

a = df.groupby('col1')['col2'].transform('size').sub(1)
b = df.groupby('col1')['col2'].transform('sum').sub(df['col2'])

df['des'] = b / a
print (df)
  col1  col2  desired       des
0    a     1     2.50  2.500000
1    a     2     2.00  2.000000
2    a     3     1.50  1.500000
3    b     2     3.00  3.000000
4    b     3     2.00  2.000000
5    c     3     1.67  1.666667
6    c     1     2.33  2.333333
7    c     2     2.00  2.000000
8    c     2     2.00  2.000000

more concise than the answer in the duplicate question. thanks! — Varun Rajan, Dec 18 '18 at 15:15

score 0 · Answer 2 · answered Dec 18 '18 at 15:22

0

Another option is filtering the selected row:

df['desired'] = df.apply(lambda x: df[~df.index.isin([x.name])].groupby('col1')['col2'].mean().loc[x['col1']], axis=1)

output:

answered Dec 18 '18 at 15:22

Tarifazo

4,118
1
9
22

Groupby and find mean of a column rowwise subsetting that row python

2 Answers2