57

Example

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s 
1    5
2    4
3    3
4    2
5    1

Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)

3    [3, 4, 5]
4    [2, 3, 4]
5    [1, 2, 3]

This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])

How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?

Thanks

Edited after seeing the answers

ok, at the end I implemented this function:

def buildLaggedFeatures(s,lag=2,dropna=True):
'''
Builds a new DataFrame to facilitate regressing over all possible lagged features
'''
if type(s) is pd.DataFrame:
    new_dict={}
    for col_name in s:
        new_dict[col_name]=s[col_name]
        # create lagged Series
        for l in range(1,lag+1):
            new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)
    res=pd.DataFrame(new_dict,index=s.index)

elif type(s) is pd.Series:
    the_range=range(lag+1)
    res=pd.concat([s.shift(i) for i in the_range],axis=1)
    res.columns=['lag_%d' %i for i in the_range]
else:
    print 'Only works for DataFrame or Series'
    return None
if dropna:
    return res.dropna()
else:
    return res 

it produces the wished outputs and manages the naming of columns in the resulting DataFrame.

For a Series as input:

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res=buildLaggedFeatures(s,lag=2,dropna=False)
   lag_0  lag_1  lag_2
1      5    NaN    NaN
2      4      5    NaN
3      3      4      5
4      2      3      4
5      1      2      3

and for a DataFrame as input:

s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5])
res2=buildLaggedFeatures(s2,lag=2,dropna=True)

   a  a_lag1  a_lag2   b  b_lag1  b_lag2
3  3       4       5  30      40      50
4  2       3       4  20      30      40
5  1       2       3  10      20      30
JJJ
  • 1,009
  • 6
  • 19
  • 31
Mannaggia
  • 4,559
  • 12
  • 34
  • 47
  • 4
    Why would you want this? Consider looking at rolling_ functions (perhaps you could write your own). – Andy Hayden Dec 05 '13 at 21:00
  • 1
    did not find a rolling function suitable for what I need to do. I plan to build a (linear) regression model which predicts a variable (eg y) using a set of input features (e.g. x1, x2, x3) and also use lagged input features (eg x1 at t-1 and t-2, etc). So I try to just prepare the input data in a way which can then be used by different prediction models – Mannaggia Dec 05 '13 at 22:26
  • 1
    @AndyHayden An example is in building ML Decision Trees that have a continuous descriptive feature. In this case, you would need to order by the continuous descriptive feature and look at where the target feature column changes values, and computer the average of the continuous descriptive feature values between the previous row (requiring lag) and the current row at that point to determine the thresholds for how to split the dataset on that descriptive feature. – josiah Apr 04 '18 at 19:52

9 Answers9

70

As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.

One solution is to concat shifted Series together to make a DataFrame:

In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1)
Out[11]: 
   0   1   2
1  5 NaN NaN
2  4   5 NaN
3  3   4   5
4  2   3   4
5  1   2   3

In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna()
Out[12]: 
   0  1  2
3  3  4  5
4  2  3  4
5  1  2  3

Doing work on this will be more efficient that on lists...

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Do you know how to do this with DataFrames? Tried with s=s.astype(float) the_range=range(lag+1) res=pd.concat([s.shift(i) for i in the_range],axis=1) res.columns=['%s_lag_%d' %(j,i) for i in the_range for j in s.columns ]. There is a problem in renaming the columns "Reindexing only valid with uniquely valued Index objects" – Mannaggia Dec 06 '13 at 00:23
  • @Mannaggia It sounds like you are looking for a Panel... I think you should ask this a separate question. However, I think actually asking about the function you're applying later (in another question) might avoid the [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Andy Hayden Dec 06 '13 at 00:38
11

Very simple solution using pandas DataFrame:

number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in xrange(1, number_lags + 1):
    df['lag_' + str(lag)] = df.vals.shift(lag)

#if you want numpy arrays with no null values: 
df.dropna().values for numpy arrays

for Python 3.x (change xrange to range)

number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in range(1, number_lags + 1):
    df['lag_' + str(lag)] = df.vals.shift(lag)

print(df)

   vals  lag_1  lag_2  lag_3
0     5    NaN    NaN    NaN
1     4    5.0    NaN    NaN
2     3    4.0    5.0    NaN
3     2    3.0    4.0    5.0
4     1    2.0    3.0    4.0
smerllo
  • 3,117
  • 1
  • 22
  • 37
ansonw
  • 1,559
  • 1
  • 16
  • 22
10

For a dataframe df with the lag to be applied on 'col name', you can use the shift function.

df['lag1']=df['col name'].shift(1)
df['lag2']=df['col name'].shift(2)
5

I like to put the lag numbers in the columns by making the columns a MultiIndex. This way, the names of the columns are retained.

Here's an example of the result:

# Setup
indx = pd.Index([1, 2, 3, 4, 5], name='time')
s=pd.Series(
    [5, 4, 3, 2, 1],
    index=indx,
    name='population')

shift_timeseries_by_lags(pd.DataFrame(s), [0, 1, 2])

Result: a MultiIndex DataFrame with two column labels: the original one ("population") and a new one ("lag"):

dataframe with lags 0, 1 and 2


Solution: Like in the accepted solution, we use DataFrame.shift and then pandas.concat.

def shift_timeseries_by_lags(df, lags, lag_label='lag'):
    return pd.concat([
        shift_timeseries_and_create_multiindex_column(df, lag,
                                                      lag_label=lag_label)
        for lag in lags], axis=1)

def shift_timeseries_and_create_multiindex_column(
        dataframe, lag, lag_label='lag'):
    return (dataframe.shift(lag)
                     .pipe(append_level_to_columns_of_dataframe,
                           lag, lag_label))

I wish there were an easy way to append a list of labels to the existing columns. Here's my solution.

def append_level_to_columns_of_dataframe(
        dataframe, new_level, name_of_new_level, inplace=False):
    """Given a (possibly MultiIndex) DataFrame, append labels to the column
    labels and assign this new level a name.

    Parameters
    ----------
    dataframe : a pandas DataFrame with an Index or MultiIndex columns

    new_level : scalar, or arraylike of length equal to the number of columns
    in `dataframe`
        The labels to put on the columns. If scalar, it is broadcast into a
        list of length equal to the number of columns in `dataframe`.

    name_of_new_level : str
        The label to give the new level.

    inplace : bool, optional, default: False
        Whether to modify `dataframe` in place or to return a copy
        that is modified.

    Returns
    -------
    dataframe_with_new_columns : pandas DataFrame with MultiIndex columns
        The original `dataframe` with new columns that have the given `level`
        appended to each column label.
    """
    old_columns = dataframe.columns

    if not hasattr(new_level, '__len__') or isinstance(new_level, str):
        new_level = [new_level] * dataframe.shape[1]

    if isinstance(dataframe.columns, pd.MultiIndex):
        new_columns = pd.MultiIndex.from_arrays(
            old_columns.levels + [new_level],
            names=(old_columns.names + [name_of_new_level]))
    elif isinstance(dataframe.columns, pd.Index):
        new_columns = pd.MultiIndex.from_arrays(
            [old_columns] + [new_level],
            names=([old_columns.name] + [name_of_new_level]))

    if inplace:
        dataframe.columns = new_columns
        return dataframe
    else:
        copy_dataframe = dataframe.copy()
        copy_dataframe.columns = new_columns
        return copy_dataframe

Update: I learned from this solution another way to put a new level in a column, which makes it unnecessary to use append_level_to_columns_of_dataframe:

def shift_timeseries_by_lags_v2(df, lags, lag_label='lag'):
    return pd.concat({
        '{lag_label}_{lag_number}'.format(lag_label=lag_label, lag_number=lag):
        df.shift(lag)
        for lag in lags},
        axis=1)

Here's the result of shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2]):

result of <code>shift_timeseries_by_lags_2</code>

Community
  • 1
  • 1
Charlie Brummitt
  • 651
  • 6
  • 11
5

Here is a cool one liner for lagged features with _lagN suffixes in column names using pd.concat:

lagged = pd.concat([s.shift(lag).rename('{}_lag{}'.format(s.name, lag+1)) for lag in range(3)], axis=1).dropna()
mac13k
  • 2,423
  • 23
  • 34
4

You can do following:

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res = pd.DataFrame(index = s.index)
for l in range(3):
    res[l] = s.shift(l)
print res.ix[3:,:].as_matrix()

It produces:

array([[ 3.,  4.,  5.],
       [ 2.,  3.,  4.],
       [ 1.,  2.,  3.]])

which I hope is very close to what you are actually want.

lowtech
  • 2,492
  • 2
  • 22
  • 31
0

For multiple (many of them) lags, this could be more compact:

df=pd.DataFrame({'year': range(2000, 2010), 'gdp': [234, 253, 256, 267, 272, 273, 271, 275, 280, 282]})
df.join(pd.DataFrame({'gdp_' + str(lag): df['gdp'].shift(lag) for lag in range(1,4)}))
Björn Backgård
  • 148
  • 1
  • 10
0

Assuming you are focusing on a single column in your data frame, saved into s. This shortcode will generate instances of the column with 7 lags.

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5], name='test')
shiftdf=pd.DataFrame()
for i in range(3):
    shiftdf = pd.concat([shiftdf , s.shift(i).rename(s.name+'_'+str(i))], axis=1)

shiftdf

>>
test_0  test_1  test_2
1   5   NaN NaN
2   4   5.0 NaN
3   3   4.0 5.0
4   2   3.0 4.0
5   1   2.0 3.0
c.Parsi
  • 771
  • 2
  • 9
  • 26
0

Based on the proposal by @charlie-brummitt, here is a revision that fix a set of columns:

def shift_timeseries_by_lags(df, fix_columns, lag_numbers, lag_label='lag'):
    df_fix = df[fix_columns]
    df_lag = df.drop(columns=fix_columns)

    df_lagged = pd.concat({f'{lag_label}_{lag}':
        df_lag.shift(lag) for lag in lag_numbers},
        axis=1)
    df_lagged.columns = ['__'.join(reversed(x)) for x in df_lagged.columns.to_flat_index()]

    return pd.concat([df_fix, df_lagged], axis=1)

Here is an example of usage:

df = shift_timeseries_by_lags(df_province_cases, fix_columns=['country', 'state'], lag_numbers=[1,2,3])

I personally prefer the lag name as suffix. But can be changed removing reversed().

juanbretti
  • 418
  • 8
  • 9