How to downsample a pandas dataframe by 2x2 averaging kernel

Question

I am trying to downsample a pandas dataframe in order to reduce granularity. In example, I want to reduce this dataframe:

to this (downsampling to obtain a 2x2 dataframe using mean):

2.25  3.25
2     2.25

Is there a builtin way or efficient way to do it or I have to write it on my own?

Thanks

How would you want to downsample? Take every two neighbouring rows and do a mean? — Viktor Kerkez, Sep 16 '13 at 10:14
I want to slice the original matrix in submatrices and then do a block mean in each submatrix.. e.g. for element (1,1) in result matrix do the block mean of submatrix (1:2, 1:2) in original matrix.. — gc5, Sep 16 '13 at 10:17
I think that 2D filter kernel is called [2x2 Box blur](https://en.wikipedia.org/wiki/Kernel_(image_processing)) — smci, May 11 '18 at 23:22

score 9 · Accepted Answer · edited Oct 21 '20 at 05:42

9

One option is to use groupby twice. Once for the index:

In [11]: df.groupby(lambda x: x//2).mean()
Out[11]:
     0    1  2    3
0  1.5  3.0  3  3.5
1  2.5  1.5  2  2.5

and once for the columns:

In [12]: df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()
Out[12]:
      0     1
0  2.25  3.25
1  2.00  2.25

Note: A solution which only calculated the mean once might be preferable... one option is to stack, groupby, mean, and unstack, but atm this is a little fiddly.

This seems significantly faster than Vicktor's solution:

In [21]: df = pd.DataFrame(np.random.randn(100, 100))

In [22]: %timeit df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()
1000 loops, best of 3: 1.64 ms per loop

In [23]: %timeit viktor()
1 loops, best of 3: 822 ms per loop

In fact, Viktor's solution crashes my (underpowered) laptop for larger DataFrames:

In [31]: df = pd.DataFrame(np.random.randn(1000, 1000))

In [32]: %timeit df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()
10 loops, best of 3: 42.9 ms per loop

In [33]: %timeit viktor()
# crashes

As Viktor points out, this doesn't work with non-integer index, if this was wanted, you could just store them as temp variables and feed them back in after:

df_index, df_cols, df.index, df.columns = df.index, df.columns, np.arange(len(df.index)), np.arange(len(df.columns))
res = df.groupby(...
res.index, res.columns = df_index[::2], df_cols[::2]

edited Oct 21 '20 at 05:42

1''

26,823
32
143
200

answered Sep 16 '13 at 14:35

Andy Hayden

359,921
101
625
535

Although I didn't include timeit's this is also faster for the OP's toy example. – Andy Hayden Sep 16 '13 at 14:40
Something is not quite right with the `timeit` here. For the `rolling_mean` I get `best of 3: 4.67 ms per loop` and for the `groupby` I get `best of 3: 940 µs per loop`. So it's around 5x slower, not 800x. – Viktor Kerkez Sep 16 '13 at 14:56
@ViktorKerkez I've just copied your entire block as a function. These are the numbers I get, and it crashes for me for "large" DataFrames. – Andy Hayden Sep 16 '13 at 14:59
Weird... I added my timing info to the post :-/ – Viktor Kerkez Sep 16 '13 at 15:05
@ViktorKerkez can repo the 800ms vs 1.65ms. :s – Andy Hayden Sep 16 '13 at 15:05
But I like the the combination of `groupby` and index divison :) I should remember that. It's useful :) So +1 – Viktor Kerkez Sep 16 '13 at 15:33
1

In Python3, make sure to use floor division // : `df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()` – WilliamEllisWebb Sep 02 '16 at 11:33

Viktor Kerkez · Answer 2 · 2013-09-16T16:12:08.373

4

You can use the rolling_mean function applied twice, first on the columns and then on the rows, and then slice the results:

rbs = 2 # row block size
cbs = 2 # column block size
pd.rolling_mean(pd.rolling_mean(df.T, cbs, center=True)[cbs-1::cbs].T,
                rbs)[rbs-1::rbs]

Which gives the same result you want, except the index will be different (but you can fix this using .reset_index(drop=True)):

      1     3
1  2.25  3.25
3  2.00  2.25

Timing info:

In [11]: df = pd.DataFrame(np.random.randn(100, 100))
In [12]: %%timeit
         pd.rolling_mean(pd.rolling_mean(df.T, 2, center=True)[1::2].T, 2)[1::2]
100 loops, best of 3: 4.75 ms per loop
In [13]: %%timeit
         df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
100 loops, best of 3: 932 µs per loop

So it's around 5x slower than the groupby not 800x :)

edited Sep 16 '13 at 16:12

answered Sep 16 '13 at 10:32

Viktor Kerkez

45,070
12
104
85

I am not sure this is what I need. The resulting matrix has to be 2x2, where each sample is the block mean between four elements of the original matrix. Your solution is interesting but I don't understand why the resulting matrix is 2x4.. – gc5 Sep 16 '13 at 10:53
@Francesco Ah sorry I misunderstood you, so you want a mean of every 2x2 submatrix? I calculated only row wise not column wise. – Viktor Kerkez Sep 16 '13 at 10:55
No problem :) Yes, assuming we have an original matrix of 4x4 elements, I want to obtain a matrix of 2x2 elements, where each element is the block mean of each 2x2 submatrix in the original matrix.. – gc5 Sep 16 '13 at 11:00
1

If you want you can improve your answer generalizing it, adding different size for rows and columns (not only bs for squared matrices). Anyway great answer :) – gc5 Sep 16 '13 at 11:27
1

This doesn't seem to scale well, e.g. try `df = pd.DataFrame(np.random.randn(100000, 100000))`. Even smaller 1000x1000 crash my interpreter. :( – Andy Hayden Sep 16 '13 at 14:36
1

@AndyHayden I tried it on 10000x10000 matrix and it finished in 12s, but yes, groupby is around 4x faster 2.86s but it only woks in cases where the column and row labels are integers. No other case will work. – Viktor Kerkez Sep 16 '13 at 14:48

How to downsample a pandas dataframe by 2x2 averaging kernel

2 Answers2

Linked