2

I'm trying to compute a recursive equation on the rows of dataframe with some arguments provided by different other datframes. The equation is provided below and should be performed for each columns of the matrix. It looks like an exponential moving average, except the decay is not constant and given from another dataframe.

recursive equation

Given:

  • a matrix Alpha of the same size as the output
  • a matrix P of the same size as the output
  • a vector M0 of the same width as the output

I did a first try with a double loop (with .iloc):

import pandas as pd
import numpy as np

"""
Assuming inputs:
    - Matrix P of size 1000x4
    - Matrix alpha of size 1000x4
    - Vector M0 of size 1X4
"""

# input variables
height = 1000
width = 4
np.random.seed(1)
P = pd.DataFrame(np.random.normal(loc=170, scale=12, size=(height, width)), index=range(height), columns=range(width))
np.random.seed(1)
alpha = pd.DataFrame(np.random.normal(loc=0.04, scale=0.04, size=(height, width)), index=range(height), columns=range(width))
np.random.seed(1)
M0 = pd.DataFrame(np.random.normal(loc=170, scale=12, size=(height, width)), columns=range(width))


# Output table
MA = P.copy()*0
MA.iloc[0] = M0 

# Recursive equation
for x in range(width):
    for y in range(1, height):
        MA.iloc[y][x] = alpha.iloc[y][x]*P.iloc[y][x] + (1-alpha.iloc)* MA.iloc[y-1][x]

and a second try with vectorization by expanding the probleme into a cumulative prod (see equation below) but failed to retrieve the values expected (code will be updated later):

expanded equation

I could rework my math. However I was wondering if there was any more efficient/simple way to do it as it takes a while.

Thank you for any help !


Update 1: Few comments:

  • My original dataframe is a price matrix for different assets (columns) and rows are days ascending downards (past at the top, present at the bottom)
  • From there, my intital moving average day depends on a function depending on the asset returning me the initial window. Thus, the algorithm is not column-symetric -My strategy is to loop over the columns, to extract the desired vectors, to perform numpy calculation and to put it back in a dataframe:

Recursive way: I rewrote my code as :

ema = P.copy()*0

for x in ema.columns:

    # define which row to start the algorithm
    start = max (100, 250, int(windows[x]))

    # store index (dates) to be re-inject after numpy calculus
    i_d = (p.iloc[start:]).index

    # extract corresponding vectors from original matrices
    alpha_temp= alpha.iloc[start:][x].values
    p_temp = p.iloc[start:][x].values
    ema_temp = ema.iloc[start:][x].values

    #MO 
    ema_temp[0] = m0[x]

    #recursive equation
    for y in range (1, len(ema_temp)):
        ema_temp[y] = alpha_temp[y]*p_temp[y]+(1-alpha_temp[y])*ema_temp[y-1]

    #transformation into a dtaframe and re-injection in the datframe ema
    ema_temp = pd.DataFrame(ema_temp)
    ema_temp.index=ema.index[-len(ema_temp):]
    ema_temp.columns=[x]
    ema.update(ema_temp)

Expansion of the equation

Thank you a_guest for your help.

# This is the product within the summation.
prod = np.flipud(np.cumprod(1 - np.flipud(alpha)))

# This is the sum over the scaled products.
sum_prod = np.cumsum((alpha * P)[:-1] * prod[1:])

# Combining all elements.
result = (alpha * P)[1:] + sum_prod + M0*prod[0]

I tried you code, but i could not provide the right answer. I'm not sure to understand it at 100%.

Assuming my data are downwards, the first row would provide :

enter image description here

I don't understand how it can be used in the second row as it already includes 1-a_n everywhere.

Thanks a lot !

lucaschn
  • 301
  • 1
  • 9
  • Provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including the code you have till now to let others help you better – Mankind_008 Jul 18 '18 at 18:25
  • How can I provide a 3000*30 dataset? the algorithm need hundreds of rows to perform. Im uploading my code very soon – lucaschn Jul 18 '18 at 19:11
  • you don't need to provide whole data. just some dummy data in the structure you working with and the expected output from that dummy data. Same goes for the code, only the part you need to be simplified is required. Refer the link in my previous comment. – Mankind_008 Jul 18 '18 at 19:49
  • Two things to be noted: Firstly, what you are doing is not recursive but sequential in nature. Secondly, calculations in your columns are independent for columns irrespective of where you initiate the moving average. Also, a query regarding your post: if i understand your requirement correctly: you need a simple way to obtain the same result from your expansion of the original equation, right ?? – Mankind_008 Jul 19 '18 at 17:09
  • @Mankind_008 yes, i'd love to compute as an easier/efficient way. So far my updated code take few seconds to process and I believe the expanded equation is the right way and that it is also a good exercise to understand and improve my python skills. thanks if you can give a hand – lucaschn Jul 19 '18 at 17:59

2 Answers2

1

I would recommend two modifications:

1. For Simplification: Due to independence of columns for calculating moving averages. A single for loop will suffice iterating over rows. Also, this will provide a minor performance boost.

for y in range(1,height):
    MA.iloc[y] = alpha.iloc[y]*P.iloc[y] + (1-alpha.iloc[y])*MA.iloc[y-1]

2. For Computational efficiency/ speed: Using indexing with numpy ndarray/ array instead of pandas dataFrame/ Series will provide considerable improvement in performance.

MA = MA.values                               # converted to ndarray from dataFrame
alpha = alpha.values                         # -do-
P = P.values                                 # -do-

for y in range(1,height):
    MA[y] = alpha[y]*P[y] + (1-alpha[y])*MA[y-1]
Mankind_008
  • 2,158
  • 2
  • 9
  • 15
0

Your expansion of the recursion formula is just the right way and you can employ numpy tools for computing the various elements. Since the result for each column is an independent calculation, the algorithm can be established for a single (1D) column. Extensions to multiple (2D) columns is trivial by adding the corresponding dimension and specifying the axis keyword appropriately for each operation. So for the 1D case it is:

# This is the product within the summation.
prod = np.flipud(np.cumprod(1 - np.flipud(alpha)))

# This is the sum over the scaled products.
sum_prod = np.cumsum((alpha * P)[:-1] * prod[1:])

# Combining all elements.
result = (alpha * P)[1:] + sum_prod + M0*prod[0]

Note that the result is given for n > 1 (using your notation; n > 0 in Python's notation) but the remaining value for n = 1 (n = 0) can be computed straightforwardly since the sum is zero.

Edit

Extensions to 2D can be achieved by providing the dimension used for computation to the axes keyword of the operations:

prod = np.flip(np.cumprod(1 - np.flip(alpha, axis=0), axis=0), axis=0)
sum_prod = np.cumsum((alpha * P)[:-1] * prod[1:], axis=0)
result = (alpha * P)[1:] + sum_prod + M0*prod[0]
a_guest
  • 34,165
  • 12
  • 64
  • 118
  • thanks 2 question. 1) How do i extend it to 2D? with df.apply? and how to not apply it on the first row ? with a if condition ? 2) I have an additional constraint as the initial M0 might not be all on the same row. I have a function returning me the starting period. How can i include it ? – lucaschn Jul 18 '18 at 21:34
  • @lucaschn See my updated answer. You second point is not clear for me. Do you mean that `M0` is different for each _column_? This is already included in the above approach (it works similarly with a scalar or a `(width,)`-shaped vector due to broadcasting). If you really mean _row_ then would that mean that for starting at row `k` that `M0[i] for i < k` is zero? Then you can include this as well by performing an additional product which brings the corresponding sum elements to zero. – a_guest Jul 18 '18 at 21:47
  • I could not find the expected answer with your code, so i wnt with a recursive way. However, i'm still interested to solve it in this way as an intellectual challenge. Please see my post – lucaschn Jul 19 '18 at 13:46
  • @lucaschn My answer is based on the expansion of the recursion which I verified to be correct for your first code example. In case `M0` starts at a different row index for each column then you can just ignore all results for `index < start_index` (or set them to zero). I don't understand your updated code example, also since there appear some `beta_temp` terms which have nowhere been defined. – a_guest Jul 19 '18 at 14:21
  • I updated and commented the code to be more explicit. To be honnest, I started python 2 months ago and might be not used to the way you wrote your code. Could you maybe provide the whole function that iterate over the columns with your code for me to understand clearly how to implement it please. I know it might be time-consumming sorry – lucaschn Jul 19 '18 at 15:06
  • @lucaschn I stay with my above result (see the **Edit** section, which is computing the result for _all_ columns "simultaneously"). I see that you have a different starting index for each column but unless this is a performance issue you can just do the computation for _all_ rows and then masking the result via `result[start:, x]`. `numpy` arrays as well as `pandas` data frames represent tabular data, that means when you have different starting indices per column, the remaining values need to be filled with something (e.g. `NaN`) in order to restore the 2D format. – a_guest Jul 19 '18 at 15:26
  • @lucaschn Okay, now I see, actually it's more complicated. Because the inner product depends on the outer index as well you would need to introduce an additional dimension for that and limit the product per element (e.g. expansion to `(width, width)` shape for `prod` and adjust accordingly along one dimension so it stops at the diagonal; can use e.g. `np.triu` for that). Then take a full sum along that dimension. For covering multiple columns you then need 3rd dimension. – a_guest Jul 19 '18 at 15:59