7

Consider a pandas DataFrame which looks like the one below

      A     B     C
0  0.63  1.12  1.73
1  2.20 -2.16 -0.13
2  0.97 -0.68  1.09
3 -0.78 -1.22  0.96
4 -0.06 -0.02  2.18

I would like to use the function .rolling() to perform the following calculation for t = 0,1,2:

  • Select the rows from t to t+2
  • Take the 9 values contained in those 3 rows, from all the columns. Call this set S
  • Compute the 75th percentile of S (or other summary statistics about S)


For instance, for t = 1 we have S = { 2.2 , -2.16, -0.13, 0.97, -0.68, 1.09, -0.78, -1.22, 0.96 } and the 75th percentile is 0.97.

I couldn't find a way to make it work with .rolling(), since it apparently takes each column separately. I'm now relying on a for loop, but it is really slow.

Do you have any suggestion for a more efficient approach?

cs95
  • 379,657
  • 97
  • 704
  • 746
Abramodj
  • 5,709
  • 9
  • 49
  • 75
  • what about [this](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.rolling.html) – Rushabh Mehta Jul 31 '18 at 14:34
  • 5
    You have enough rep know that *not post the data as picture* – BENY Jul 31 '18 at 14:35
  • @RushabhMehta: that's exactly the function I am trying to use, but I don't understand how pull the data from the 3 columns A,B and C together to compute the percentile. – Abramodj Jul 31 '18 at 14:39
  • 1
    Just paste it as text and format it as code. Read this: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – user3483203 Jul 31 '18 at 14:45
  • I don't like the answers involving reshaping things. I used [`scipy.ndimage.generic_filter`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.generic_filter.html) to do this properly, though it needed some trial and error to get the same effect as Pandas' `rolling`'s `center=False` effect. – Ahmed Fasih Apr 19 '20 at 19:25

2 Answers2

4

One solution is to stack the data and then multiply your window size by the number of columns and slice the result by the number of columns. Also, since you want a forward looking window, reverse the order of the stacked DataFrame

wsize = 3
cols = len(df.columns)

df.stack(dropna=False)[::-1].rolling(window=wsize*cols).quantile(0.75)[cols-1::cols].reset_index(-1, drop=True).sort_index()

Output:

0    1.12
1    0.97
2    0.97
3     NaN
4     NaN
dtype: float64

In the case of many columns and a small window:

import pandas as pd
import numpy as np

wsize = 3
df2 = pd.concat([df.shift(-x) for x in range(wsize)], 1)
s_quant = df2.quantile(0.75, 1)

# Only necessary if you need to enforce sufficient data. 
s_quant[df2.isnull().any(1)] = np.NaN

Output: s_quant

0    1.12
1    0.97
2    0.97
3     NaN
4     NaN
Name: 0.75, dtype: float64
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • This is not equivalent, as it would compute the statistics many more times. This is because the window would move every 1 entry and not every 3. In my real dataset I have many more columns than 3, so this actually makes a huge difference computationally speaking. – Abramodj Jul 31 '18 at 15:10
  • @Abramodj I find it takes 8x longer for 1000x the number of columns. So as long as you have a reasonable number of columns, the excess probably wont kill you. It wasn't clear that you needed an extremely performant solution over just a solution to your problem. Let me see what I can do. – ALollz Jul 31 '18 at 15:23
  • Actually, I have more than 10k columns – Abramodj Jul 31 '18 at 15:27
  • About one thousand, so the for loop is actually feasible in this case. I just wanted to understand if this can be done, as it could be very useful in other circumstances. – Abramodj Jul 31 '18 at 15:31
  • 1
    @ALollz I think you should specify dropna=True in stack, otherwise the window size would be consistent – FLab Aug 15 '18 at 15:05
  • Yeah, that was a typo...definitely: stack(dropna=False) – FLab Aug 15 '18 at 15:32
0

You can use numpy ravel. Still you may have to use for loops.

for i in range(0,3):
    print(df.iloc[i:i+3].values.ravel())

If your t steps in 3s, you can use numpy reshape function to create a n*9 dataframe.

  • This is the solution I am using at the moment, which requires a for loop. It is not as fast as `rolling` because it is not vectorized – Abramodj Jul 31 '18 at 15:07
  • 2
    @Abramodj Have you tried this: `x = pd.concat([df, df.shift(-1),df.shift(-2)], axis=1)` then `x['m'] = x.mean(axis=1)`. For quartile you can use `x['q3'] = x.quantile(0.75, axis=1)`. – Dhanush1215 Jul 31 '18 at 15:22