0

I am trying to develop some code to preprocess some data for an Autoregressive algorithm. In order to do so, I am adding new columns to the dataFrame I am using for the learning process (these new columns contain former values of the output). I am doing so with the following code, after struggling quite a lot:

for i in range(0, n):
    tmpOutput = pd.Series(output.ix[i:len(output.index)-n+i, 1])
    tmpOutput.index = range(n, len(output.index) + 1)
    tmpOutput.name = 'T-' + str(n-i)
    tmp = tmp.join([tmpOutput])

You could see I am first extracting some data and building a Series from it; I then modify the index and rename the series (to avoid some naming conflict in my loop) and finally, I perform a join. I was wandering if this code can be enhanced, if there exists an alternative way, with better performances.

smci
  • 32,567
  • 20
  • 113
  • 146
  • Unless you have something specific in mind you'd like to ask about, there is a good chance that your quesiton will be summarily executed. – Mad Physicist Dec 12 '16 at 09:46
  • Instead of voting to close, you could vote to migrate to CodeReview.SE, where it's definitely on-topic. – smci Dec 12 '16 at 09:49
  • @PeioBourreau, you have to **state a clear question**, and **show us code using reproducible data** (e.g. declare `output`, using a random-seed. What are its dimensions? data-type?). This code is unclear, irreproducible and won't execute standalone - we don't have your dataframe `output`, and `self.order` is a reference to some class whose code we don't have. Please fix all those. And if the issue is performance (runtime), then show the current runtimes using timeit. – smci Dec 12 '16 at 09:51

2 Answers2

0

It looks like the intent of your code is to create n lagged versions of series output, although without seeing the code that consumes it (please show it, or at least the formula), we're working in the dark; I imagine you can avoid needing to create and store n lagged copies of the same series. So don't create this matrix in the first place! Anyway even if you really needed to create and append, there's no need to do it iteratively with a for-loop, which will be pretty slow. Avoid for-loops, and avoid for-loops containing join on an entire series. Also, there's no need to manually create their index.

For a lag function, see pandas.DataFrame.shift() or pandas.Series.shift() or also the rolling functions.

Also see e.g. Andy Hayden's answer to "How to create a lagged data structure using pandas dataframe"

python: shift column in pandas dataframe up by one

Community
  • 1
  • 1
smci
  • 32,567
  • 20
  • 113
  • 146
  • Thanks a lot, that is really informative. I apologize as my initial question/code was unclear and did not include some data to help you all. – PeioBourreau Dec 13 '16 at 14:47
0

Implementing hints by smci, here is a solution using pandas concat and shift functions:

import numpy as np
import pandas as pd

np.random.seed(1)

ts = pd.Series(np.random.rand(100))

max_lag = 5

column_names = ['T + %d'%lag for lag in range(max_lag+1)]

df = pd.concat([ts.shift(lag) for lag in range(max_lag+1)], keys = column_names, axis = 1)

df.head(10)
Out[5]: 
      T + 0     T + 1     T + 2     T + 3     T + 4     T + 5
0  0.417022       NaN       NaN       NaN       NaN       NaN
1  0.720324  0.417022       NaN       NaN       NaN       NaN
2  0.000114  0.720324  0.417022       NaN       NaN       NaN
3  0.302333  0.000114  0.720324  0.417022       NaN       NaN
4  0.146756  0.302333  0.000114  0.720324  0.417022       NaN
5  0.092339  0.146756  0.302333  0.000114  0.720324  0.417022
6  0.186260  0.092339  0.146756  0.302333  0.000114  0.720324
7  0.345561  0.186260  0.092339  0.146756  0.302333  0.000114
8  0.396767  0.345561  0.186260  0.092339  0.146756  0.302333
9  0.538817  0.396767  0.345561  0.186260  0.092339  0.146756
FLab
  • 7,136
  • 5
  • 36
  • 69
  • I think the most important thing is to tell the OP **to not do this**, it's a totally unnecessary matrix, since they want improved performance. It's not scalable either, it'll eat memory as well as CPU. Just use the `rolling` functions to pass lagged versions of the output to the Autoregressive algorithm. – smci Dec 13 '16 at 00:56
  • I get your point, still this version is way more efficient that the original one suggested by the OP, so I think it is worth flagging it as an easy solution. Without further info we can't know to what degree performance is essential to him. – FLab Dec 13 '16 at 08:56
  • not creating the unnecessary matrix should always be more performant than creating the unnecessary matrix :) – smci Dec 16 '16 at 07:21