3

I have a table like this :

import pandas as pd
values = [0,0,0,2000,0,0,700,0,0,530,1000,820,0,0,200]
durations = [0,0,0,12,0,0,8,0,0,2,5,15,0,0,3]

ex = pd.DataFrame({'col_to_roll' : values, 'max_duration': durations})

    col_to_roll  max_duration
0             0             0
1             0             0
2             0             0
3          2000            12
4             0             0
5             0             0
6           700             8
7             0             0
8             0             0
9           530             2
10         1000             5
11          820            15
12            0             0
13            0             0
14          200             3

For each row position i, I want to do a rolling sum of col_to_roll between indexes i-7 and i-4 (both included). The caveat is that I want the values "further in the past" to be counted more, depending on the column max_duration (which tells for how many timesteps in the future that value can still have an effect).
There's a higher bound which is the remaining timesteps to be counted (min 1, max 4). So if I'm on row number 7 doing the roll-up sum: the value on row number 1 will be counted min(max_duration[1],4), the value on row number 2 will be counted min(max_duration[2],3) etc.

I could do it the brute force way :

new_col = []
for i in range(7,len(ex)) : 
    rolled_val = sum([ex.iloc[j].col_to_roll*min(ex.iloc[j].max_duration , i-j+1-4) \
                     for j in range(i-7,i-3)])
    new_col.append(rolled_val)
ex['rolled_col'] = [np.nan]*7+new_col

Which lands the following results from the example above :

        col_to_roll  max_duration  rolled_col
    0             0             0         NaN
    1             0             0         NaN
    2             0             0         NaN
    3          2000            12         NaN
    4             0             0         NaN
    5             0             0         NaN
    6           700             8         NaN
    7             0             0      2000.0
    8             0             0      4000.0
    9           530             2      6000.0
    10         1000             5      8700.0
    11          820            15      1400.0
    12            0             0      2100.0
    13            0             0      3330.0
    14          200             3      2060.0

That being said, I'd appreciate a more elegant (and more importantly, more efficient) way to get this result with some pandas magic.

Rik Kraan
  • 586
  • 2
  • 16
mlx
  • 504
  • 1
  • 4
  • 15

3 Answers3

0

Just to share my ideas, this can be solved by using numpy without a for-loop

import numpy as np

ex_len = ex.shape[0]
inds = np.vstack([range(i-7,i-3) for i in range(7,ex_len)])
# part one 
col_to_roll = np.take(ex.col_to_roll.values,inds)
# part two
max_duration = np.take(ex.max_duration.values,inds)
duration_to_compare = np.array([[i-j+1-4 for j in range(i-7,i-3)]for i in range(7,ex_len)])
min_mask = max_duration > duration_to_compare
max_duration[min_mask] = duration_to_compare[min_mask]

new_col = np.sum(col_to_roll*max_duration,axis=1)
ex['rolled_col'] = np.concatenate(([np.nan]*7,new_col))
meTchaikovsky
  • 7,478
  • 2
  • 15
  • 34
0

Here is my humble idea about an elegant and efficient method for this task. To not reinvent the wheel, let's install pandarallel by invoking pip install pandarallel. I am a fan of multiprocessing thing, and it should help with larger data.

import pandas as pd
import numpy as np
from pandarallel import pandarallel

def rocknroll(index):
    if index>=7:
        a = ex['col_to_roll'].iloc[index-7:index-3]
        b = map(min, ex['max_duration'].iloc[index-7:index-3], [4,3,2,1])
        return sum(map(mul, a, b))
    else:
        return np.nan
    
pandarallel.initialize()
    
values = [0,0,0,2000,0,0,700,0,0,530,1000,820,0,0,200]
durations = [0,0,0,12,0,0,8,0,0,2,5,15,0,0,3]

ex = pd.DataFrame({'col_to_roll' : values, 'max_duration': durations})
ex['index_copy'] = list(range(0, len(ex)))
ex['rolled_col'] = ex['index_copy'].apply(rocknroll)
ex.drop(columns={'index_copy'}, inplace=True)
print(ex)

Output:

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
    col_to_roll  max_duration  rolled_col
0             0             0         NaN
1             0             0         NaN
2             0             0         NaN
3          2000            12         NaN
4             0             0         NaN
5             0             0         NaN
6           700             8         NaN
7             0             0      2000.0
8             0             0      4000.0
9           530             2      6000.0
10         1000             5      8700.0
11          820            15      1400.0
12            0             0      2100.0
13            0             0      3330.0
14          200             3      2060.0

Further information about proper element-wise operation can be found here Element-wise addition of 2 lists?

Sandi
  • 150
  • 1
  • 9
0

You can use pd.rolling() to create rolling windows in combination with apply to calculate the rolled_coll sum for the specified rolling windows

First calculate the window size using the lower & upper bound (and add 1 to include both indices). This enables you to play around with different time intervals.

lower_bound = -7
upper_bound = -4
window_size = upper_bound - lower_bound + 1

Second define the function to apply on each rolling window. In your case taking take the product of the col_to_roll and the minimum value of max_duration & a list of range 4 to 0 and summing all values in the sliding window.

def calculate_rolled_count(series, ex):
    index = series.index
    min_values = np.minimum(ex.loc[index, 'max_duration'].values, list(range(4, 0, -1)))
    return np.sum(ex.loc[index, 'col_to_roll'] * min_values)

Finally assign a new column rolled_coll to your original dataframe and apply the defined function over all rolling windows. We have to shift the columns to make the value correspond to the desired row (as the rolling window by default sets the values to the right bound of the window)

ex.assign(rolled_col=lambda x: x.rolling(window_size)
                                .apply(lambda x: calculate_rolled_count(x, ex))
                                .shift(-upper_bound)['max_duration'])

Result

    col_to_roll     max_duration    rolled_col
0   0               0               NaN
1   0               0               NaN
2   0               0               NaN
3   2000            12              NaN
4   0               0               NaN
5   0               0               NaN
6   700             8               NaN
7   0               0               2000.0
8   0               0               4000.0
9   530             2               6000.0
10  1000            5               8700.0
11  820             15              1400.0
12  0               0               2100.0
13  0               0               3330.0
14  200             3               2060.0
Rik Kraan
  • 586
  • 2
  • 16