Improve performance of iterative join code which creates lagged versions of DataFrame series

Question

I am trying to develop some code to preprocess some data for an Autoregressive algorithm. In order to do so, I am adding new columns to the dataFrame I am using for the learning process (these new columns contain former values of the output). I am doing so with the following code, after struggling quite a lot:

for i in range(0, n):
    tmpOutput = pd.Series(output.ix[i:len(output.index)-n+i, 1])
    tmpOutput.index = range(n, len(output.index) + 1)
    tmpOutput.name = 'T-' + str(n-i)
    tmp = tmp.join([tmpOutput])

You could see I am first extracting some data and building a Series from it; I then modify the index and rename the series (to avoid some naming conflict in my loop) and finally, I perform a join. I was wandering if this code can be enhanced, if there exists an alternative way, with better performances.

Unless you have something specific in mind you'd like to ask about, there is a good chance that your quesiton will be summarily executed. — Mad Physicist, Dec 12 '16 at 09:46
Instead of voting to close, you could vote to migrate to CodeReview.SE, where it's definitely on-topic. — smci, Dec 12 '16 at 09:49
@PeioBourreau, you have to **state a clear question**, and **show us code using reproducible data** (e.g. declare `output`, using a random-seed. What are its dimensions? data-type?). This code is unclear, irreproducible and won't execute standalone - we don't have your dataframe `output`, and `self.order` is a reference to some class whose code we don't have. Please fix all those. And if the issue is performance (runtime), then show the current runtimes using timeit. — smci, Dec 12 '16 at 09:51

score 0 · Answer 1 · edited May 23 '17 at 12:13

It looks like the intent of your code is to create n lagged versions of series output, although without seeing the code that consumes it (please show it, or at least the formula), we're working in the dark; I imagine you can avoid needing to create and store n lagged copies of the same series. So don't create this matrix in the first place! Anyway even if you really needed to create and append, there's no need to do it iteratively with a for-loop, which will be pretty slow. Avoid for-loops, and avoid for-loops containing join on an entire series. Also, there's no need to manually create their index.

For a lag function, see pandas.DataFrame.shift() or pandas.Series.shift() or also the rolling functions.

Also see e.g. Andy Hayden's answer to "How to create a lagged data structure using pandas dataframe"

python: shift column in pandas dataframe up by one

Thanks a lot, that is really informative. I apologize as my initial question/code was unclear and did not include some data to help you all. — PeioBourreau, Dec 13 '16 at 14:47

score 0 · Answer 2 · answered Dec 12 '16 at 12:36

Implementing hints by smci, here is a solution using pandas concat and shift functions:

import numpy as np
import pandas as pd

np.random.seed(1)

ts = pd.Series(np.random.rand(100))

max_lag = 5

column_names = ['T + %d'%lag for lag in range(max_lag+1)]

df = pd.concat([ts.shift(lag) for lag in range(max_lag+1)], keys = column_names, axis = 1)

df.head(10)
Out[5]: 
      T + 0     T + 1     T + 2     T + 3     T + 4     T + 5
0  0.417022       NaN       NaN       NaN       NaN       NaN
1  0.720324  0.417022       NaN       NaN       NaN       NaN
2  0.000114  0.720324  0.417022       NaN       NaN       NaN
3  0.302333  0.000114  0.720324  0.417022       NaN       NaN
4  0.146756  0.302333  0.000114  0.720324  0.417022       NaN
5  0.092339  0.146756  0.302333  0.000114  0.720324  0.417022
6  0.186260  0.092339  0.146756  0.302333  0.000114  0.720324
7  0.345561  0.186260  0.092339  0.146756  0.302333  0.000114
8  0.396767  0.345561  0.186260  0.092339  0.146756  0.302333
9  0.538817  0.396767  0.345561  0.186260  0.092339  0.146756

I think the most important thing is to tell the OP **to not do this**, it's a totally unnecessary matrix, since they want improved performance. It's not scalable either, it'll eat memory as well as CPU. Just use the `rolling` functions to pass lagged versions of the output to the Autoregressive algorithm. — smci, Dec 13 '16 at 00:56
I get your point, still this version is way more efficient that the original one suggested by the OP, so I think it is worth flagging it as an easy solution. Without further info we can't know to what degree performance is essential to him. — FLab, Dec 13 '16 at 08:56
not creating the unnecessary matrix should always be more performant than creating the unnecessary matrix :) — smci, Dec 16 '16 at 07:21

Improve performance of iterative join code which creates lagged versions of DataFrame series

2 Answers2