1

I have a data frame where one of the columns is Date (i.e. 5/29/2007) and then there is col A and col B and a few other columns. The date and the value of col A and col B uniquely identify the row. I want to add columns to the data frame which are the historical values of other columns (not A or B). For example I have some row. Call it row i and it has some value for col D. And I want to add a column to the data frame which corresponds to the value of col D but for a previous day. That is the value of col D where the row, call it row j, has the same values for col A and col B as row i but the date is the date of row i minus one day. How can I do this efficiently?

Also I will want to do this for multiple columns, not just col D.

Here is an example of what I want. Below is my current Data frame csv.

Date, col-A, col-B, col-C, col-D
5/29/2007, A, B, 0, 1
5/29/2007, AA, Bf, 7, 124
5/29/2007, AV, Bf, 1, 4
5/30/2007, A, B, 10, 1
5/30/2007, AA, Bf, 3, 1
5/30/2007, AV, Bf, 1, 8

This is the desired output data I want my data frame to have

Date, col-A, col-B, col-C, col-D, col-D-1
5/30/2007, A, B, 10, 1, 1
5/30/2007, AA, Bf, 3, 1, 124
5/30/2007, AV, Bf, 1, 8, 4
user3494047
  • 1,643
  • 4
  • 31
  • 61

2 Answers2

1

I would create a shifted copy, then merge with the original.

# shifted copy
shiftInterval = pd.Timedelta('-1 days') # your X days interval
dfShifted = df.copy()
dfShifted['Date'] = dfShifted['Date'] - shiftInterval
dfShifted.drop(['col-C'], axis=1, inplace=True)

# merge, keeping only observations where -1 lag is present
df2 = pd.merge(df,
           dfShifted,
           on=['Date','col-A','col-B'],
           how = 'inner', # use 'left' to keep observations without lags
           suffixes=['','-1'])

df2
#        Date col-A col-B  col-C  col-D  col-D-1
#0 2007-05-30     A     B     10      1        1
#1 2007-05-30    AA    Bf      3      1      124
#2 2007-05-30    AV    Bf      1      8        4

The above answer assumes you have a reproducible data set created by the following:

from io import StringIO
import pandas as pd
from datetime import datetime

df = pd.read_csv(StringIO("""Date,col-A,col-B,col-C,col-D
5/29/2007,A,B,0,1
5/29/2007,AA,Bf,7,124
5/29/2007,AV,Bf,1,4
5/30/2007,A,B,10,1
5/30/2007,AA,Bf,3,1
5/30/2007,AV,Bf,1,8"""))

df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y'))

This is kind of a manual hack, since there is already a shift function. I previously tried using pandas.DataFrame.shift() but couldn't make it work with freq and indexing. Someone who knows how should answer this question.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • Note that `col-D-1` gets coerced to float because there's no NA integer for [some stupid reason](https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value) in pandas – C8H10N4O2 Oct 20 '17 at 15:33
  • Is this solvable by somehow removing all rows with NA? – user3494047 Oct 20 '17 at 16:14
  • @user3494047 please see edit. If you don't care about keeping the rows with NA values, you can use `how='inner'` and everything comes out `int64` – C8H10N4O2 Oct 20 '17 at 17:02
  • Note that you're subtracting the shift in the shifted copy because that's the date you want it to join *to*. If it makes it easier to think of a positive lag and adding that lag, flip both signs. – C8H10N4O2 Oct 20 '17 at 17:14
0
df['E'] = [
 df.loc[
  (df['date']==(df.loc[row,'date']-pd.Timedelta('1 day')))&
  (df['A']==df.loc[row,'A'])&
  (df['B']==df.loc[row,'B'])
 ]['D'] for row in df.index]
Acccumulation
  • 3,491
  • 1
  • 8
  • 12