0

I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.

I have two lists that have the start and end dates of 30 day windows:

start_dts = [2014-01-01,...] end_dts = [2014-01-30,...]

I have a dataframe with a field called 'transaction_dt'.

What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.

EDIT:

As requested here is some sample data of my format:

'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
Pylander
  • 1,531
  • 1
  • 17
  • 36

2 Answers2

0

IIUC

By suing IntervalIndex

df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values


df
Out[457]: 
  transaction_dt        End      Start
0     2017-01-02 2017-01-31 2017-01-01
1     2017-03-02 2017-03-31 2017-03-01
2     2017-04-02 2017-04-30 2017-04-01
3     2017-05-02 2017-05-31 2017-05-01

Data Input :

df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I am getting a KeyError related to missing values and earlier in the error call back nested tuple slicing related to "cannot index with multidimensional key" – Pylander Oct 05 '17 at 17:13
0

If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:

import io
import pandas as pd
import datetime

string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""

df = pd.read_csv(io.StringIO(string))

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])

df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)

df

Returns

customer_id transaction_dt  product price   units   start   end
0   1   2004-01-02  thing1  25  47  2004-01-01  2004-01-31
1   1   2004-01-17  thing2  150 8   2004-01-01  2004-01-31
2   2   2004-01-29  thing2  150 25  2004-01-01  2004-01-31

new approach:

import io
import pandas as pd
import datetime

string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""

df = pd.read_csv(io.StringIO(string))

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])

# Get all timestamps that are necessary
# This assumes dates are sorted 
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
    timestamps.append(timestamps[-1]+datetime.timedelta(days=30))

# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))

# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
    for i,(start,end) in enumerate(ranges):   
        if (value >= start and value <= end):
            df.loc[ind, "start"] = start
            df.loc[ind, "end"] = end
            # When match is found let's  also 
            # remove all ranges that aren't met
            # This can be removed if dates are not sorted
            # But this should speed things up for large datasets
            for _ in range(i):
                ranges.pop(0) 
Anton vBR
  • 18,287
  • 5
  • 40
  • 46