8

I am trying to downsample a pandas dataframe in order to reduce granularity. In example, I want to reduce this dataframe:

1  2  3  4
2  4  3  3
2  2  1  3
3  1  3  2

to this (downsampling to obtain a 2x2 dataframe using mean):

2.25  3.25
2     2.25

Is there a builtin way or efficient way to do it or I have to write it on my own?

Thanks

smci
  • 32,567
  • 20
  • 113
  • 146
gc5
  • 9,468
  • 24
  • 90
  • 151
  • How would you want to downsample? Take every two neighbouring rows and do a mean? – Viktor Kerkez Sep 16 '13 at 10:14
  • I want to slice the original matrix in submatrices and then do a block mean in each submatrix.. e.g. for element (1,1) in result matrix do the block mean of submatrix (1:2, 1:2) in original matrix.. – gc5 Sep 16 '13 at 10:17
  • I think that 2D filter kernel is called [2x2 Box blur](https://en.wikipedia.org/wiki/Kernel_(image_processing)) – smci May 11 '18 at 23:22

2 Answers2

9

One option is to use groupby twice. Once for the index:

In [11]: df.groupby(lambda x: x//2).mean()
Out[11]:
     0    1  2    3
0  1.5  3.0  3  3.5
1  2.5  1.5  2  2.5

and once for the columns:

In [12]: df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()
Out[12]:
      0     1
0  2.25  3.25
1  2.00  2.25

Note: A solution which only calculated the mean once might be preferable... one option is to stack, groupby, mean, and unstack, but atm this is a little fiddly.

This seems significantly faster than Vicktor's solution:

In [21]: df = pd.DataFrame(np.random.randn(100, 100))

In [22]: %timeit df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()
1000 loops, best of 3: 1.64 ms per loop

In [23]: %timeit viktor()
1 loops, best of 3: 822 ms per loop

In fact, Viktor's solution crashes my (underpowered) laptop for larger DataFrames:

In [31]: df = pd.DataFrame(np.random.randn(1000, 1000))

In [32]: %timeit df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()
10 loops, best of 3: 42.9 ms per loop

In [33]: %timeit viktor()
# crashes

As Viktor points out, this doesn't work with non-integer index, if this was wanted, you could just store them as temp variables and feed them back in after:

df_index, df_cols, df.index, df.columns = df.index, df.columns, np.arange(len(df.index)), np.arange(len(df.columns))
res = df.groupby(...
res.index, res.columns = df_index[::2], df_cols[::2]
1''
  • 26,823
  • 32
  • 143
  • 200
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
4

You can use the rolling_mean function applied twice, first on the columns and then on the rows, and then slice the results:

rbs = 2 # row block size
cbs = 2 # column block size
pd.rolling_mean(pd.rolling_mean(df.T, cbs, center=True)[cbs-1::cbs].T,
                rbs)[rbs-1::rbs]

Which gives the same result you want, except the index will be different (but you can fix this using .reset_index(drop=True)):

      1     3
1  2.25  3.25
3  2.00  2.25

Timing info:

In [11]: df = pd.DataFrame(np.random.randn(100, 100))
In [12]: %%timeit
         pd.rolling_mean(pd.rolling_mean(df.T, 2, center=True)[1::2].T, 2)[1::2]
100 loops, best of 3: 4.75 ms per loop
In [13]: %%timeit
         df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
100 loops, best of 3: 932 µs per loop

So it's around 5x slower than the groupby not 800x :)

Viktor Kerkez
  • 45,070
  • 12
  • 104
  • 85
  • I am not sure this is what I need. The resulting matrix has to be 2x2, where each sample is the block mean between four elements of the original matrix. Your solution is interesting but I don't understand why the resulting matrix is 2x4.. – gc5 Sep 16 '13 at 10:53
  • @Francesco Ah sorry I misunderstood you, so you want a mean of every 2x2 submatrix? I calculated only row wise not column wise. – Viktor Kerkez Sep 16 '13 at 10:55
  • No problem :) Yes, assuming we have an original matrix of 4x4 elements, I want to obtain a matrix of 2x2 elements, where each element is the block mean of each 2x2 submatrix in the original matrix.. – gc5 Sep 16 '13 at 11:00
  • 1
    If you want you can improve your answer generalizing it, adding different size for rows and columns (not only bs for squared matrices). Anyway great answer :) – gc5 Sep 16 '13 at 11:27
  • 1
    This doesn't seem to scale well, e.g. try `df = pd.DataFrame(np.random.randn(100000, 100000))`. Even smaller 1000x1000 crash my interpreter. :( – Andy Hayden Sep 16 '13 at 14:36
  • 1
    @AndyHayden I tried it on 10000x10000 matrix and it finished in 12s, but yes, groupby is around 4x faster 2.86s but it only woks in cases where the column and row labels are integers. No other case will work. – Viktor Kerkez Sep 16 '13 at 14:48