12

I have a situation where I have a dataframe row to perform calculations with, and I need to use values in following (potentially preceding) rows to do these calculations (essentially a perfect forecast based on the real data set). I get each row from an earlier df.apply call, so I could pass the whole df along to the downstream objects, but that seems less than ideal based on the complexity of objects in my analysis.

I found one closely related question and answer [1], but the problem is actually fundamentally different in the sense that I do not need the whole df for my calcs, simply the following x number of rows (which might matter for large dfs).

So, for example:

df = pd.DataFrame([100, 200, 300, 400, 500, 600, 700, 800, 900, 1000], 
                  columns=['PRICE'])
horizon = 3

I need to access values in the following 3 (horizon) rows in my row-wise df.apply call. How can I get a naive forecast of the next 3 data points dynamically in my row-wise apply calcs? e.g. for row the first row, where the PRICE is 100, I need to use [200, 300, 400] as a forecast in my calcs.

[1] apply a function to a pandas Dataframe whose returned value is based on other rows

Community
  • 1
  • 1
lukewitmer
  • 1,153
  • 3
  • 11
  • 21

2 Answers2

10

By getting the row's index inside of the df.apply() call using row.name, you can generate the 'forecast' data relative to which row you are currently on. This is effectively a preprocessing step to put the 'forecast' onto the relevant row, or it could be done as part of the initial df.apply() call if the df is available downstream.

df = pd.DataFrame(
    [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
    columns=["PRICE"]
)
horizon = 3
    
df["FORECAST"] = df.apply(
    lambda x: [df["PRICE"][x.name + 1 : x.name + horizon + 1]],
    axis=1
)

Results in this:

   PRICE          FORECAST
0    100   [200, 300, 400]
1    200   [300, 400, 500]
2    300   [400, 500, 600]
3    400   [500, 600, 700]
4    500   [600, 700, 800]
5    600   [700, 800, 900]
6    700  [800, 900, 1000]
7    800       [900, 1000]
8    900            [1000]
9   1000                []

Which can be used in your row-wise df.apply() calcs.

EDIT: If you want to strip the index from the resulting 'Forecast':

df["FORECAST"] = df.apply(
    lambda x: [df["PRICE"][x.name + 1 : x.name + horizon + 1].reset_index(drop=True)],
    axis=1
)
Paul P
  • 3,346
  • 2
  • 12
  • 26
lukewitmer
  • 1,153
  • 3
  • 11
  • 21
2

You may find this useful as well.

keys = range(horizon + 1)
pd.concat([df.shift(-i) for i in keys], axis=1, keys=keys)

      0       1       2       3
  PRICE   PRICE   PRICE   PRICE
0   100   200.0   300.0   400.0
1   200   300.0   400.0   500.0
2   300   400.0   500.0   600.0
3   400   500.0   600.0   700.0
4   500   600.0   700.0   800.0
5   600   700.0   800.0   900.0
6   700   800.0   900.0  1000.0
7   800   900.0  1000.0     NaN
8   900  1000.0     NaN     NaN
9  1000     NaN     NaN     NaN

if you assign the concat to df_c

keys = range(horizon + 1)
df_c = pd.concat([df.shift(-i) for i in keys], axis=1, keys=keys)

df_c.apply(lambda x: pd.Series([x[0].values, x[1:].values]), axis=1)

          0                       1
0   [100.0]   [200.0, 300.0, 400.0]
1   [200.0]   [300.0, 400.0, 500.0]
2   [300.0]   [400.0, 500.0, 600.0]
3   [400.0]   [500.0, 600.0, 700.0]
4   [500.0]   [600.0, 700.0, 800.0]
5   [600.0]   [700.0, 800.0, 900.0]
6   [700.0]  [800.0, 900.0, 1000.0]
7   [800.0]    [900.0, 1000.0, nan]
8   [900.0]      [1000.0, nan, nan]
9  [1000.0]         [nan, nan, nan]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Do you think there is a performance benefit one way or the other, and under what circumstances? – lukewitmer May 11 '16 at 13:58
  • 1
    @lukewitmer No, I just ran `%%timeit` and your wat is 5+ times quicker. I like my way because I find it more elegent. But I'd go with you way because its plenty intuitive and its quicker. – piRSquared May 11 '16 at 14:49