2

I have a Pandas DataFrame:

import pandas as pd

df = pd.DataFrame([['A', '2014-01-01', '2014-01-07', 1.2],
                   ['B', '2014-01-01', '2014-01-07', 2.5],
                   ['C', '2014-01-01', '2014-01-07', 3.],
                   ['A', '2014-01-08', '2014-01-14', 13.],
                   ['B', '2014-01-08', '2014-01-14', 2.],
                   ['C', '2014-01-08', '2014-01-14', 1.],
                   ['A', '2014-01-15', '2014-01-21', 10.],
                   ['A', '2014-01-21', '2014-01-27', 98.],
                   ['B', '2014-01-21', '2014-01-27', -5.],
                   ['C', '2014-01-21', '2014-01-27', -72.],
                   ['A', '2014-01-22', '2014-01-28', 8.],
                   ['B', '2014-01-22', '2014-01-28', 25.],
                   ['C', '2014-01-22', '2014-01-28', -23.],
                   ['A', '2014-01-22', '2014-02-22', 8.],
                   ['B', '2014-01-22', '2014-02-22', 25.],
                   ['C', '2014-01-22', '2014-02-22', -23.],
                  ], columns=['Group', 'Start Date', 'End Date', 'Value'])

And the output looks like this:

   Group  Start Date    End Date  Value
0      A  2014-01-01  2014-01-07    1.2
1      B  2014-01-01  2014-01-07    2.5
2      C  2014-01-01  2014-01-07    3.0
3      A  2014-01-08  2014-01-14   13.0
4      B  2014-01-08  2014-01-14    2.0
5      C  2014-01-08  2014-01-14    1.0
6      A  2014-01-15  2014-01-21   10.0
7      A  2014-01-21  2014-01-27   98.0
8      B  2014-01-21  2014-01-27   -5.0
9      C  2014-01-21  2014-01-27  -72.0
10     A  2014-01-22  2014-01-28    8.0
11     B  2014-01-22  2014-01-28   25.0
12     C  2014-01-22  2014-01-28  -23.0
13     A  2014-01-22  2014-02-22    8.0
14     B  2014-01-22  2014-02-22   25.0
15     C  2014-01-22  2014-02-22  -23.0

I am trying to add a new column with data from the same group in the previous period (if it exists). So, the output should look like this:

   Group  Start Date    End Date  Value   Last Period Value
0      A  2014-01-01  2014-01-07    1.2                 NaN
1      B  2014-01-01  2014-01-07    2.5                 NaN
2      C  2014-01-01  2014-01-07    3.0                 NaN
3      A  2014-01-08  2014-01-14   13.0                 1.2
4      B  2014-01-08  2014-01-14    2.0                 2.5   
5      C  2014-01-08  2014-01-14    1.0                 3.0
6      A  2014-01-15  2014-01-21   10.0                13.0 
7      A  2014-01-21  2014-01-27   98.0                 NaN
8      B  2014-01-21  2014-01-27   -5.0                 NaN
9      C  2014-01-21  2014-01-27  -72.0                 NaN
10     A  2014-01-22  2014-01-28    8.0                10.0     
11     B  2014-01-22  2014-01-28   25.0                 NaN
12     C  2014-01-22  2014-01-28  -23.0                 NaN
13     A  2014-01-22  2014-02-22    8.0                 NaN   
14     B  2014-01-22  2014-02-22   25.0                 NaN   
15     C  2014-01-22  2014-02-22  -23.0                 NaN   

Notice that the rows with NaN do not have a corresponding value with the same group and that is in the last period. So, rows that span 7 days (one week) need to be matched with the same row with the same group but from the previous week.

slaw
  • 6,591
  • 16
  • 56
  • 109
  • How is the "previous period" defined? Are periods equivalent to calendar weeks or can there be arbitrary periods? If they are always equivalent to one week, it might be helpful to convert the period start date to a week number. https://stackoverflow.com/questions/31181295/converting-a-pandas-date-to-week-number – malte Feb 08 '16 at 21:11
  • The period can be variable (defined by the number of days). So, row index #3 is 7 days and the last 7 day period (with the same group) directly before it is row index #0. So, the group has to be the same, the number of days has to be the same, and the two periods must be continuous (the start date of the current period is one day after the end date of the last period). – slaw Feb 08 '16 at 21:18
  • In using the week number, are the week numbers ever increasing or do they start at 1 on January 1st? Again, the length of each period can be variable so I'm not sure this would work. – slaw Feb 08 '16 at 21:20

3 Answers3

1

Suppose we compute the duration between Start and End for each row:

df['duration'] = df['End']-df['Start']

and suppose we also compute the previous Start value based on that duration:

df['Prev'] = df['Start'] - df['duration'] - pd.Timedelta(days=1)

Then we can express the desired DataFrame as the result of a merge between df and itself where we merge rows whose Group, duration and Prev (in one DataFrame) match the Group, duration and Start (in the other DataFrame):

import pandas as pd

df = pd.DataFrame([['A', '2014-01-01', '2014-01-07', 1.2],
                   ['B', '2014-01-01', '2014-01-07', 2.5],
                   ['C', '2014-01-01', '2014-01-07', 3.],
                   ['A', '2014-01-08', '2014-01-14', 3.],
                   ['B', '2014-01-08', '2014-01-14', 2.],
                   ['C', '2014-01-08', '2014-01-14', 1.],
                   ['A', '2014-01-15', '2014-01-21', 10.],
                   ['A', '2014-01-21', '2014-01-27', 98.],
                   ['B', '2014-01-21', '2014-01-27', -5.],
                   ['C', '2014-01-21', '2014-01-27', -72.],
                   ['A', '2014-01-22', '2014-01-28', 8.],
                   ['B', '2014-01-22', '2014-01-28', 25.],
                   ['C', '2014-01-22', '2014-01-28', -23.],
                   ['A', '2014-01-22', '2014-02-22', 8.],
                   ['B', '2014-01-22', '2014-02-22', 25.],
                   ['C', '2014-01-22', '2014-02-22', -23.],
                  ], columns=['Group', 'Start', 'End', 'Value'])
for col in ['Start', 'End']:
    df[col] = pd.to_datetime(df[col])

df['duration'] = df['End']-df['Start']
df['Prev'] = df['Start'] - df['duration'] - pd.Timedelta(days=1)

result = pd.merge(df, df[['Group','duration','Start','Value']], how='left',
                  left_on=['Group','duration','Prev'], 
                  right_on=['Group','duration','Start'], suffixes=['', '_y'])
result = result[['Group', 'Start', 'End', 'Value', 'Value_y']]
result = result.rename(columns={'Value_y':'Prev Value'})
print(result)

yields

   Group      Start        End  Value  Prev Value
0      A 2014-01-01 2014-01-07    1.2         NaN
1      B 2014-01-01 2014-01-07    2.5         NaN
2      C 2014-01-01 2014-01-07    3.0         NaN
3      A 2014-01-08 2014-01-14    3.0         1.2
4      B 2014-01-08 2014-01-14    2.0         2.5
5      C 2014-01-08 2014-01-14    1.0         3.0
6      A 2014-01-15 2014-01-21   10.0         3.0
7      A 2014-01-21 2014-01-27   98.0         NaN
8      B 2014-01-21 2014-01-27   -5.0         NaN
9      C 2014-01-21 2014-01-27  -72.0         NaN
10     A 2014-01-22 2014-01-28    8.0        10.0
11     B 2014-01-22 2014-01-28   25.0         NaN
12     C 2014-01-22 2014-01-28  -23.0         NaN
13     A 2014-01-22 2014-02-22    8.0         NaN
14     B 2014-01-22 2014-02-22   25.0         NaN
15     C 2014-01-22 2014-02-22  -23.0         NaN

In the comments, Artur Nowak asks about the time complexity of pd.merge. I believe it is doing a O(N + M) hash join where N is the size of the hashed table, and M the size of the lookup table. Here is some code to test the performance of pd.merge as a function of DataFrame size empirically.

import collections
import string
import timeit 
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

timing = collections.defaultdict(list)

def make_df(ngroups, ndur, ndates):
    groups = list(string.uppercase[:ngroups])
    durations = range(ndur)
    start = pd.date_range('2000-1-1', periods=ndates, freq='D')

    index = pd.MultiIndex.from_product([start, durations, groups], 
                                       names=['Start', 'duration', 'Group'])
    values = np.arange(len(index))
    df = pd.DataFrame({'Value': values}, index=index).reset_index()
    df['End'] = df['Start'] + pd.to_timedelta(df['duration'], unit='D')
    df = df.drop('duration', axis=1)
    df = df[['Group', 'Start', 'End', 'Value']]

    df['duration'] = df['End']-df['Start']
    df['Prev'] = df['Start'] - df['duration'] - pd.Timedelta(days=1)
    return df

def using_merge(df):
    result = pd.merge(df, df[['Group','duration','Start','Value']], how='left',
                      left_on=['Group','duration','Prev'], 
                      right_on=['Group','duration','Start'], suffixes=['', '_y'])
    return result

Ns = np.array([10**i for i in range(5)])
for n in Ns:
    timing['merge'].append(timeit.timeit(
        'using_merge(df)',
        'from __main__ import using_merge, make_df; df = make_df(10, 10, {})'.format(n),
        number=5))

print(timing['merge'])
slope, intercept, rval, pval, stderr = stats.linregress(Ns, timing['merge'])
print(slope, intercept, rval, pval, stderr)

plt.plot(Ns, timing['merge'], label='merge')
plt.plot(Ns, slope*Ns + intercept)
plt.legend(loc='best')
plt.show()

This suggests for DataFrames of tens of thousands of rows, pd.merge's speed is roughly linear.

enter image description here

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Just out of curiosity, do you have any knowledge what is computational complexity of `merge` operation? I was unable to find this information and I wonder how it compares with splitting the data per group and duration (defined as in your answer), sorting by period start and then going through the data sequentially. – Artur Nowak Feb 11 '16 at 12:00
  • @ArturNowak: I believe `pd.merge` performs a [hash join which is `O(N+M)`](http://stackoverflow.com/a/2065782/190597). As a practical matter, I think it is always necessary to benchmark both versions on data approximating the actual use-case to determine which is faster (for that use-case). I've added some `timeit` code to investigate the performance of `pd.merge` as a function of the size of the DataFrame. If you would add code to do the splitting/sorting/sequential processing, we could do some empirical testing. – unutbu Feb 11 '16 at 16:03
0

The simplest method (although with quadratic complexity) would be as follows:

import datetime as dt
df.sd = pd.to_datetime(df['Start Date'])
df.ed = pd.to_datetime(df['End Date'])

def find_previous_period(row):
  prev_sd = row.sd - dt.timedelta(days=7)
  prev_ed = row.ed - dt.timedelta(days=7)
  prev_period = df[(df.sd == prev_sd) & (df.ed == prev_ed) & (df.Group == row.Group)]
  if prev_period.size > 0:
    return prev_period.irow(0).Value

df['Last Period Value'] = df.apply(find_previous_period, axis=1)

Some more elegant solution may be required if you have a lot of data.


Update for the requirement that the number of days need to be the same (from the comments):

def find_previous_period(row):
  delta = row.ed - row.sd + dt.timedelta(days=1)
  prev_sd = row.sd - delta
  prev_ed = row.ed - delta
  prev_period = df[(df.sd == prev_sd) & (df.ed == prev_ed) & (df.Group == row.Group)]
  if prev_period.size > 0:
    return prev_period.irow(0).Value
Artur Nowak
  • 5,254
  • 3
  • 22
  • 32
  • Indeed, I have a lot of data and so I was hoping to find a more elegant and faster solution than n-squared performance. – slaw Feb 08 '16 at 22:04
0

If I'm understanding your definition of "period" right, this will work and should be pretty fast.

  df['sd'] = pd.to_datetime(df['Start Date'])
  df['sd2'] = df.sd - dt.timedelta(days=1)
  df['ed2'] = df.ed - dt.timedelta(days=1)

  df2 = pd.merge(df, df[['sd2','ed2','Value', 'Group']], left_on=['sd','Group', 'ed'], 
           right_on=['sd2','Group', 'ed2'], how='outer', copy=False)

You'll have to clean up the column names / delete the extra columns.

  • This is close. The number of days within the two periods (current and last) have to also be the same. This is in addition to the two periods being continuous as you have. – slaw Feb 09 '16 at 01:03
  • The timedelta days should be equal to (end - start + 1) for each row. Otherwise, the periods would be overlapping rather than consecutive and continuous (i.e., neighboring weeks). – slaw Feb 09 '16 at 06:32