6

How to aggregate in the way to get the average of b for group a, while excluding the current row (the target result is in c)?

a b   c

1 1   0.5   # (avg of 0 & 1, excluding 1)
1 1   0.5   # (avg of 0 & 1, excluding 1)
1 0   1     # (avg of 1 & 1, excluding 0)

2 1   0.5   # (avg of 0 & 1, excluding 1)
2 0   1     # (avg of 1 & 1, excluding 0)
2 1   0.5   # (avg of 0 & 1, excluding 1)

3 1   0.5   # (avg of 0 & 1, excluding 1)
3 0   1     # (avg of 1 & 1, excluding 0)
3 1   0.5   # (avg of 0 & 1, excluding 1)

Data dump:

import pandas as pd
data = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1], 
                     [2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
                     columns=['a', 'b', 'c'])
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160

2 Answers2

9

Suppose a group has values x_1, ..., x_n.

The average of the entire group would be

m = (x_1 + ... + x_n)/n

The sum of the group without x_i would be

(m*n - x_i)

The average of the group without x_i would be

(m*n - x_i)/(n-1)

Therefore, you could compute the desired column of values with

import pandas as pd
df = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1], 
                     [2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
                     columns=['a', 'b', 'c'])

grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)

which yields

In [32]: df
Out[32]: 
   a  b    c  result
0  1  1  0.5     0.5
1  1  1  0.5     0.5
2  1  0  1.0     1.0
3  2  1  0.5     0.5
4  2  0  1.0     1.0
5  2  1  0.5     0.5
6  3  1  0.5     0.5
7  3  0  1.0     1.0
8  3  1  0.5     0.5

In [33]: assert df['result'].equals(df['c'])

Per the comments below, in the OP's actual use case, the DataFrame's a column contains strings:

def make_random_str_array(letters, strlen, size):
    return (np.random.choice(list(letters), size*strlen)
            .view('|S{}'.format(strlen)))

N = 3*10**6
df = pd.DataFrame({'a':make_random_str_array(letters='ABCD', strlen=10, size=N),
                   'b':np.random.randint(10, size=N)})

so that there are about a million unique values in df['a'] out of 3 million total:

In [87]: uniq, key = np.unique(df['a'], return_inverse=True)
In [88]: len(uniq)
Out[88]: 988337

In [89]: len(df)
Out[89]: 3000000

In this case the calculation above requires (on my machine) about 11 seconds:

In [86]: %%timeit
   ....: grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
   ....:    ....:    ....:    ....: 
1 loops, best of 3: 10.5 s per loop

Pandas converts all string-valued columns to object dtype. But we could convert the DataFrame column to a NumPy array with a fixed-width dtype, and the group according to those values.

Here is a benchmark showing that if we convert the Series with object dtype to a NumPy array with fixed-width string dtype, the calculation requires less than 2 seconds:

In [97]: %%timeit
   ....: grouped = df.groupby(df['a'].values.astype('|S4'))
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
   ....:    ....:    ....:    ....: 
1 loops, best of 3: 1.39 s per loop

Beware that you need to know the maximum length of the strings in df['a'] to choose the appropriate fixed-width dtype. In the example above, all the strings have length 4, so |S4 works. If you use |Sn for some integer n and n is smaller than the longest string, then those strings will get silently truncated with no error warning. This could potentially lead to the grouping of values which should not be grouped together. Thus, the onus is on you to choose the correct fixed-width dtype.

You could use

dtype = '|S{}'.format(df['a'].str.len().max())
grouped = df.groupby(df['a'].values.astype(dtype))

to ensure the conversion uses the correct dtype.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Also, what would happen if there is only 1 case? I'd like to make that the average of all the other cases that do not fail. – PascalVKooten May 16 '15 at 10:43
  • So replace afterwards. – PascalVKooten May 16 '15 at 10:44
  • It's incredibly slow though (4 million rows). I'm thinking perhaps the better way is to just aggregate including the row, and then subtract, weighted by counts (your first comment helped with the idea)? – PascalVKooten May 16 '15 at 10:47
  • There is something different about your real use case that I'm not seeing. I've run the code above with `df = pd.concat([df]*1000000)` so that `len(df)` is 9 million, and `%timeit df.groupby(['a'])['b'].transform(ave_others)` completes in 1.18 seconds. – unutbu May 16 '15 at 11:17
  • Perhaps to do with the amount of unique groups? There are 1 million out of 3 million unique groups. – PascalVKooten May 16 '15 at 11:19
  • Yes, I was experimenting with that just now. With `df = pd.DataFrame(np.random.randint(10000, size=(N,2)), columns=['a','b'])`, with `N = 10**6`, I'm still getting `%timeit df.groupby(['a'])['b'].transform(ave_others)` to complete in 2.96 seconds. – unutbu May 16 '15 at 11:21
  • Absurd: when I do the exact same thing with a different 'groupby' variable, it also takes me 2.5 seconds. – PascalVKooten May 17 '15 at 08:15
  • What is the dtype of the problematic column? You can find the dtypes with `df.info()`. – unutbu May 17 '15 at 11:52
  • That was the first thing I checked afterwards: both are `dtype=object` – PascalVKooten May 17 '15 at 11:53
  • With `dtype=object`, grouping requires equality checking of the Python objects. This will be much slower than grouping when the dtype is a native NumPy dtype. – unutbu May 17 '15 at 11:55
  • I could cast them as `str`, would that improve the speed? – PascalVKooten May 17 '15 at 11:57
  • Yes, I believe converting the strings to a fixed-width dtype can improve speed. Since [Pandas converts all string-valued columns to `object` dtype](http://stackoverflow.com/a/21841473/190597), you can not store the fixed-width NumPy array in the DataFrame itself. But you can still use that NumPy array as the values by which to groupby: `grouped = df.groupby(df['a'].values.astype('|S4'))`. I've edited the post above to show some benchmarks. – unutbu May 17 '15 at 13:13
2

You can calculate the statistics manually by iterating group by group:

# Set up input
import pandas as pd
df = pd.DataFrame([
        [1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], 
        [2, 1, 0.5], [2, 0, 1], [2, 1, 0.5], 
        [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]
    ], columns=['a', 'b', 'c'])
df
   a  b    c
0  1  1  0.5
1  1  1  0.5
2  1  0  1.0
3  2  1  0.5
4  2  0  1.0
5  2  1  0.5
6  3  1  0.5
7  3  0  1.0
8  3  1  0.5

# Perform grouping, excluding the current row
results = []
grouped = df.groupby(['a'])
for key, group in grouped:
    for idx, row in group.iterrows():
        # The group excluding current row
        group_other = group.drop(idx)  
        avg = group_other['b'].mean()
        results.append(row.tolist() + [avg])

# Compare our results with what is expected
results_df = pd.DataFrame(
    results, columns=['a', 'b', 'c', 'c_new']
)
results_df
   a  b    c  c_new
0  1  1  0.5    0.5
1  1  1  0.5    0.5
2  1  0  1.0    1.0
3  2  1  0.5    0.5
4  2  0  1.0    1.0
5  2  1  0.5    0.5
6  3  1  0.5    0.5
7  3  0  1.0    1.0
8  3  1  0.5    0.5

This way you can use any statistic you want.

ostrokach
  • 17,993
  • 11
  • 78
  • 90