3

I have a pandas pivot table that was previously shifted and now looks like this:

pivot
    A    B    C    D    E
0  5.3  5.1  3.5  4.2  4.5
1  5.3  4.1  3.5  4.2  NaN
2  4.3  4.1  3.5  NaN  NaN
3  4.3  4.1  NaN  NaN  NaN
4  4.3  NaN  NaN  NaN  NaN

I'm trying to calculate a rolling average with a variable window (in this case 3 and 4 periods) over the inverse diagonal iterating over every column and trying to store that value in a new dataframe, which would look like this:

expected_df with a 3 periods window
    A    B    C    D    E
0  4.3  4.1  3.5  4.2  4.5

expected_df with a 4 periods window
    A    B    C    D    E
0  4.5  4.3  3.5  4.2  4.5

So far, I tried to subset the original pivot table and create a different dataframe that only contains the specified window values for each column, to then calculate the average, like this:

subset
    A    B    C    D    E
0  4.3  4.1  3.5  4.2  4.5
1  4.3  4.1  3.5  4.2  NaN
2  4.3  4.1  3.5  NaN  NaN

For this, I tried to build the following for loop:

df2 = pd.DataFrame()
size = pivot.shape[0]
window = 3

for i in range(size): 
    df2[i] = pivot.iloc[size-window-i:size-i,i]

Which does not work even when this pivot.iloc[size-window-i:size-i,i] does return the values I need when I manually pass in the indexes, but in the for loop, it misses the first value of the second column and so on:

df2
    A    B    C    D    E
0  4.3  NaN  NaN  NaN  NaN
1  4.3  4.1  NaN  NaN  NaN
2  4.3  4.1  3.5  NaN  NaN

Does anyone have a good idea on how to calculate the moving average or on how to fix the for loop part? Thanks in advance for your comments.

AFVM
  • 33
  • 4
  • `np.diag(df.values[:,::-1])[::-1]`? – Divakar Feb 25 '19 at 16:10
  • Thanks Divakar, this is a really nice piece code to have but only gives me values of the inverse diagonal, what I really need is to calculate the average from that value upwards, the number of items averaged being the window variable. – AFVM Feb 25 '19 at 16:19
  • Add in the final expected o/p? – Divakar Feb 25 '19 at 16:20
  • I fail to see how your first result is the result of a rolling mean. Do you want to calculate the column-wise mean/average excluding the `nan`s? If yes, use `df.mean(axis=1, skipna=True)`. – Jan Christoph Terasa Feb 25 '19 at 16:21
  • @Divakar it's the expected_df – AFVM Feb 25 '19 at 16:29
  • @JanChristophTerasa i do want to calculate the column-wise mean/average but not of every value in the column since the data is seasonal and the complete average would not be precise in anyway, so `df.mean(axis=1, skipna=True)` I don't think would work – AFVM Feb 25 '19 at 16:31

1 Answers1

5

IIUC:

shift everything back

shifted = pd.concat([df.iloc[:, i].shift(i) for i in range(df.shape[1])], axis=1)
shifted

     A    B    C    D    E
0  5.3  NaN  NaN  NaN  NaN
1  5.3  5.1  NaN  NaN  NaN
2  4.3  4.1  3.5  NaN  NaN
3  4.3  4.1  3.5  4.2  NaN
4  4.3  4.1  3.5  4.2  4.5

Then you can get your mean.

# Change this  to get the last n number of rows
shifted.iloc[-3:].mean()

A    4.3
B    4.1
C    3.5
D    4.2
E    4.5
dtype: float64

Or the rolling mean

#   Change this  to get the last n number of rows
shifted.rolling(3, min_periods=1).mean()

          A         B    C    D    E
0  5.300000       NaN  NaN  NaN  NaN
1  5.300000  5.100000  NaN  NaN  NaN
2  4.966667  4.600000  3.5  NaN  NaN
3  4.633333  4.433333  3.5  4.2  NaN
4  4.300000  4.100000  3.5  4.2  4.5

Numpy strides

I'll use strides to construct a 3-D array and average over one of the axes. This is faster but confusing as all ...

Also, I wouldn't use this. I just wanted to nail down how to grab diagonal elements via strides. This was more practice for me and I wanted to share.

from numpy.lib.stride_tricks import as_strided as strided

a = df.values

roll = 3
r_ = roll - 1  # one less than roll

h, w = a.shape
w_ = w - 1  # one less than width

b = np.empty((h + 2 * w_ + r_, w), dtype=a.dtype)
b.fill(np.nan)
b[w_ + r_:-w_] = a

s0, s1 = b.strides
a_ = np.nanmean(strided(b, (h + w_, roll, w), (s0, s0, s1 - s0))[w_:], axis=1)

pd.DataFrame(a_, df.index, df.columns)

          A         B    C    D    E
0  5.300000       NaN  NaN  NaN  NaN
1  5.300000  5.100000  NaN  NaN  NaN
2  4.966667  4.600000  3.5  NaN  NaN
3  4.633333  4.433333  3.5  4.2  NaN
4  4.300000  4.100000  3.5  4.2  4.5

Numba

I feel better about this than I do using strides

import numpy as np
from numba import njit
import warnings

@njit
def dshift(a, roll):
  h, w = a.shape
  b = np.empty((h, roll, w), dtype=np.float64)
  b.fill(np.nan)

  for r in range(roll):
    for i in range(h):
      for j in range(w):
        k = i - j - r
        if k >= 0:
          b[i, r, j] = a[k, j]

  return b

with warnings.catch_warnings():
  warnings.simplefilter('ignore', category=RuntimeWarning)

  df_ = pd.DataFrame(np.nanmean(dshift(a, 3), axis=1, ), df.index, df.columns)

df_

          A         B    C    D    E
0  5.300000       NaN  NaN  NaN  NaN
1  5.300000  5.100000  NaN  NaN  NaN
2  4.966667  4.600000  3.5  NaN  NaN
3  4.633333  4.433333  3.5  4.2  NaN
4  4.300000  4.100000  3.5  4.2  4.5
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    Thanks a lot, I only shifted in the beginning because the end customer of this has a specific format for the result and totally missed the fact that it's easier to calculate the rolling mean row-wise than column-wise – AFVM Feb 25 '19 at 16:44
  • 1
    `justify(a,invalid_val=np.nan,axis=0,side='down')` from https://stackoverflow.com/a/44559180/? Again, I don't think I have understood the question completely, so I am not too sure if this solves it. – Divakar Feb 25 '19 at 17:29
  • I remember that (-: – piRSquared Feb 25 '19 at 17:30
  • 1
    Replacement for the first method : `strided_indexing_roll(df.T, np.arange(df.shape[1]))` from https://stackoverflow.com/a/51614167/. Also related - https://stackoverflow.com/a/51613442/. – Divakar Feb 25 '19 at 18:54