12

consider the df

tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df

I want to calculate the sum over a trailing 5 days, every 3 days.

I expect something that looks like this

enter image description here

this was edited
what I had was incorrect. @ivan_pozdeev and @boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.


criteria

  • I'm looking for smart efficient solutions that can be scaled to large data sets.

  • I'll be timing solutions and also considering elegance.

  • Solutions should also be generalizable for a variety of sample and look back frequencies.


from comments

  • I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
    • for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
  • I want the timestamp to be the last observed timestamp within the look back period.
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 2
    `df.rolling(window=5, min_periods=3).sum().dropna().resample('3D').last()`? Quite similar to the answer that is posted. – Nickil Maveli Oct 24 '16 at 07:09
  • are you taking the average date index as well, or would first or last index of the 5 day window make more sense? – Aaron Oct 28 '16 at 19:55
  • @Aaron last date in the look back window. – piRSquared Oct 28 '16 at 19:56
  • @piRSquared so basically what Steven G returned and not what your example output image is? – Aaron Oct 28 '16 at 19:57
  • @piRSquared also will the data frequency be consistent within a given set? (rolling window of fixed width?) – Aaron Oct 28 '16 at 20:07
  • @Aaron I want something generalized to handle looking back 5 days and grabbing anything that's there. – piRSquared Oct 28 '16 at 20:10
  • Would the dates always be such that there's exactly one entry per day and without any gap (no dates are missed between two consecutive entries)? Seeing your latest comment I am sensing that might not be maintained. – Divakar Oct 28 '16 at 20:11
  • @Divakar I'm expecting the `DatetimeIndex` could have irregular frequency. – piRSquared Oct 28 '16 at 20:12
  • 1
    [Off topic as a work request.](http://meta.stackoverflow.com/questions/274630/should-we-add-a-do-my-work-for-me-close-reason) [codegolf.se] maybe. – ivan_pozdeev Oct 30 '16 at 02:20
  • On second thought, the task is one-concern and generalized enough to have reuse value. – ivan_pozdeev Oct 30 '16 at 02:26
  • Considering the sample, let's say it had 12 elements instead of `11` i.e. : `pd.date_range('2012-12-31', periods=12, freq='D')`. What would be the last output? Would it be `'2013-01-11'` and `11` value at column `A` or `7+8+9+10+11` or we just omit that interval altogether from consideration? – Divakar Oct 30 '16 at 17:44

6 Answers6

12

the df you gave us is :

             A
2012-12-31   0
2013-01-01   1
2013-01-02   2
2013-01-03   3
2013-01-04   4
2013-01-05   5
2013-01-06   6
2013-01-07   7
2013-01-08   8
2013-01-09   9
2013-01-10  10

you could create your rolling 5-day sum series and then resample it. I can't think of a more efficient way than this. overall this should be relatively time efficient.

df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]: 
                 A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000
Steven G
  • 16,244
  • 8
  • 53
  • 77
  • This is a good answer so plus one. I select it as the answer if I don't get a better one. I'm looking for two things. One, dates are aligned. Two, the initial roll feels inefficient seeing as how we only use every third observation. – piRSquared Oct 24 '16 at 04:50
  • 1
    @piRSquared I am not sure what what do you mean by dates are aligned, can you walk me through how you get to `10.0` as at `2013-01-02` and I understand that it feels like we are doing too much calculation by evaluating the 5 days sum **every day** – Steven G Oct 24 '16 at 12:09
  • 1
    @StevenG I believe `rolling().sum()` is smart enough to calculate in an incremental (and vectorized) manner so I doubt there is really much loss from doing it every day in this example. Although possibly if we were taking the sum every 20 days instead of every 3 days this is not true. – JohnE Oct 29 '16 at 18:09
5

Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.

Scenario #1 : Multiple entries per date, but no missing dates

Approach #1 :

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
    # Extract the index names and values
    vals = df.A.values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()
    search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)

    # Perform bin based summing and subtract the repeated ones
    IDsums = np.bincount(id_arr,vals)
    allsums = IDsums[:-1] + IDsums[1:]
    allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

Approach #2 :

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
    # Extract the index names and values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()

    # Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern    
    # Pad with 0 and length of array at either ends as we use diff later on
    shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
    search_id = np.hstack((0,shiftIDs,date_id[-1]+1))

    # Find the start of those shifting indices    
    # Generate ID based on shifts and do bin based summing of dataframe
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)    
    IDsums = np.bincount(id_arr,df.A.values)

    # Sum each group of 3 elems with a stride of 2, make dataframe if needed
    allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]    

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

Approach #3 :

def vectorized_app3(df, S=3, W=5):
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

We could replace the convolution part with direct sliced summation for a modified version of it -

def vectorized_app3_v2(df, S=3, W=5):  
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    f = c.size+S-W
    out = c[:f:S].copy()
    for i in range(1,W):
        out += c[i:f+i:S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

Scenario #2 : Multiple entries per date and missing dates

Approach #4 :

def vectorized_app4(df, S=3, W=5):
    dt = df.index.values
    indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
    WL = ((indx[-1]+1)//S)
    c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
    freq_str = str(S)+'D'
    grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
    out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
    return pd.DataFrame(out,index=out_index,columns=['A'])

Scenario #3 : Consecutive dates and exactly one entry per date

Approach #5 :

def vectorized_app5(df, S=3, W=5):
    vals = df.A.values
    N = (df.shape[0]-W+2*S-1)//S
    n = vals.strides[0]
    out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
                                        strides=(S*n,n)).sum(1)
    index_idx = (W-1)+S*np.arange(N)
    out_index = df.index[index_idx]
    return pd.DataFrame(out,index=out_index,columns=['A'])

Suggestions for creating test-data

Scenario #1 :

# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])

Scenario #2 :

To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -

df_data = np.random.randint(0,9,(len(idx0)))

Scenario #3 :

# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • @JohnE Well in the 12 case, there would be just 1 elem in the last interval and the expected behavior is not defined for such a case. Posted a comment to OP on the same. – Divakar Oct 30 '16 at 21:00
  • thanks @divakar! That was a fun experiment. I'm super busy at the moment. I eventually follow up with my own answer ;-) Your answers are always great. – piRSquared Nov 01 '16 at 21:46
  • Wooah @piRSquared! Loved the huge bounty, thanks! :) Nice answers all around to this interesting question. I am sure this Q&A would be helpful to future readers, so keep up the motivation work. – Divakar Nov 01 '16 at 21:51
5

If the dataframe is sorted by date, what we actually have is iterating over an array while calculating something.

Here's the algorithm that calculates sums all in one iteration over the array. To understand it, see a scan of my notes below. This is the base, unoptimized version intended to showcase the algorithm (optimized ones for Python and Cython follow), and list(<call>) takes ~500 ms for an array of 100k on my system (P4). Since Python ints and ranges are relatively slow, this should benefit tremendously from being transferred to C level.

from __future__ import division
import numpy as np

#The date column is unimportant for calculations.
# I leave extracting the numbers' column from the dataframe
# and adding a corresponding element from data column to each result
# as an exercise for the reader
data = np.random.randint(100,size=100000)

def calc_trailing_data_with_interval(data,n,k):
    """Iterate over `data', computing sums of `n' trailing elements
    for each `k'th element.
    @type data: ndarray
    @param n: number of trailing elements to sum up
    @param k: interval with which to calculate sums
    """
    lim_index=len(data)-k+1

    nsums = int(np.ceil(n/k))
    sums = np.zeros(nsums,dtype=data.dtype)
    M=n%k
    Mp=k-M

    index=0
    currentsum=0

    while index<lim_index:
        for _ in range(Mp):
            #np.take is awkward, requiring a full list of indices to take
            for i in range(currentsum,currentsum+nsums-1):
                sums[i%nsums]+=data[index]
            index+=1
        for _ in range(M):
            sums+=data[index]
            index+=1
        yield sums[currentsum]
        currentsum=(currentsum+1)%nsums
  • Note that it produces the first sum at kth element, not nth (this can be changed but by sacrificing elegance - a number of dummy iterations before the main loop - and is more elegantly done by prepending data with extra zeros and discarding a number of first sums)
  • It can easily be generalized to any operation by replacing sums[slice]+=data[index] with operation(sums[slice],data[index]) where operation is a parameter and should be a mutating operation (like ndarray.__iadd__).
  • parallelizing between any number or workers by splitting the data is as easy (if n>k, chunks after the first one should be fed extra elements at the start)

To deduce the algorithm, I wrote a sample for a case where a decent number of sums are calculated simultaneously in order to see patterns (click the image to see it full-size).

notes outlining a case n=11. k=3


Optimized: pure Python

Caching range objects brings the time down to ~300ms. Surprisingly, numpy functionality is of no help: np.take is unusable, and replacing currentsum logic with static slices and np.roll is a regression. Even more surprisingly, the benefit of saving output to an np.empty as opposed to yield is nonexistent.

def calc_trailing_data_with_interval(data,n,k):
    """Iterate over `data', computing sums of `n' trailing elements
    for each `k'th element.
    @type data: ndarray
    @param n: number of trailing elements to sum up
    @param k: interval with which to calculate sums
    """
    lim_index=len(data)-k+1

    nsums = int(np.ceil(n/k))
    sums = np.zeros(nsums,dtype=data.dtype)
    M=n%k
    Mp=k-M
    RM=range(M)     #cache for efficiency
    RMp=range(Mp)   #cache for efficiency

    index=0
    currentsum=0
    currentsum_ranges=[range(currentsum,currentsum+nsums-1)
            for currentsum in range(nsums)]     #cache for efficiency

    while index<lim_index:
        for _ in RMp:
            #np.take is unusable as it allocates another array rather than view
            for i in currentsum_ranges[currentsum]:
                sums[i%nsums]+=data[index]
            index+=1
        for _ in RM:
            sums+=data[index]
            index+=1
        yield sums[currentsum]
        currentsum=(currentsum+1)%nsums

Optimized: Cython

Statically typing everything in Cython instantly speeds things up to 150ms. And (optionally) assuming np.int as dtype to be able to work with data at C level brings the time down to as little as ~11ms. At this point, saving to an np.empty does make a difference, saving an unbelievable ~6.5ms, totalling ~5.5ms.

def calc_trailing_data_with_interval(np.ndarray data,int n,int k):
    """Iterate over `data', computing sums of `n' trailing elements
    for each `k'th element.
    @type data: 1-d ndarray
    @param n: number of trailing elements to sum up
    @param k: interval with which to calculate sums
    """
    if not data.ndim==1: raise TypeError("One-dimensional array required")
    cdef int lim_index=data.size-k+1

    cdef np.ndarray result = np.empty(data.size//k,dtype=data.dtype)
    cdef int rindex = 0

    cdef int nsums = int(np.ceil(float(n)/k))
    cdef np.ndarray sums = np.zeros(nsums,dtype=data.dtype)

    #optional speedup for dtype=np.int
    cdef bint use_int_buffer = data.dtype==np.int and data.flags.c_contiguous
    cdef int[:] cdata = data
    cdef int[:] csums = sums
    cdef int[:] cresult = result

    cdef int M=n%k
    cdef int Mp=k-M

    cdef int index=0
    cdef int currentsum=0

    cdef int _,i
    while index<lim_index:
        for _ in range(Mp):
            #np.take is unusable as it allocates another array rather than view
            for i in range(currentsum,currentsum+nsums-1):
                if use_int_buffer:  csums[i%nsums]+=cdata[index]    #optional speedup
                else:               sums[i%nsums]+=data[index]
            index+=1
        for _ in range(M):
            if use_int_buffer:
                for i in range(nsums): csums[i]+=cdata[index]   #optional speedup
            else:               sums+=data[index]
            index+=1

        if use_int_buffer:  cresult[rindex]=csums[currentsum]     #optional speedup
        else:               result[rindex]=sums[currentsum]
        currentsum=(currentsum+1)%nsums
        rindex+=1
    return result
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
4

For regularly-spaced dates only

Here are two methods, first a pandas way and second a numpy function.

>>> n=5   # trailing periods for rolling sum
>>> k=3   # frequency of rolling sum calc

>>> df.rolling(n).sum()[-1::-k][::-1]

               A
2013-01-01   NaN
2013-01-04  10.0
2013-01-07  25.0
2013-01-10  40.0

And here's a numpy function (adapted from Jaime's numpy moving_average):

def rolling_sum(a, n=5, k=3):
    ret = np.cumsum(a.values)
    ret[n:] = ret[n:] - ret[:-n]
    return pd.DataFrame( ret[n-1:][-1::-k][::-1], 
                         index=a[n-1:][-1::-k][::-1].index )

rolling_sum(df,n=6,k=4)   # default n=5, k=3

For irregularly-spaced dates (or regularly-spaced)

Simply precede with:

df.resample('D').sum().fillna(0)

For example, the above methods become:

df.resample('D').sum().fillna(0).rolling(n).sum()[-1::-k][::-1]

and

rolling_sum( df.resample('D').sum().fillna(0) )

Note that dealing with irregularly-spaced dates can be done simply and elegantly in pandas as this is a strength of pandas over almost anything else out there. But you can likely find a numpy (or numba or cython) approach that will trade off some simplicity for an increase in speed. Whether this is a good tradeoff will depend on your data size and performance requirements, of course.

For the irregularly spaced dates, I tested on the following example data and it seemed to work correctly. This will produce a mix of missing, single, and multiple entries per date:

np.random.seed(12345)
per = 11
tidx = np.random.choice( pd.date_range('2012-12-31', periods=per, freq='D'), per )
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx).sort_index()
Community
  • 1
  • 1
JohnE
  • 29,156
  • 8
  • 79
  • 109
  • 1
    So, this assumes exactly one entry per date, right? If so, can it be modified to accept multiple entries per date? – Divakar Oct 30 '16 at 08:12
  • Well the question stated : `"for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back."`and that's only possible with multiple entries and so my approaches :) – Divakar Oct 30 '16 at 13:58
  • @Divakar -- Ahhh, I didn't understand what you were saying but now I do. Thanks, I added a correction for this above. – JohnE Oct 30 '16 at 15:17
3

this isn't quite perfect yet, but I've gotta go make fake blood for a haloween party tonight... you should be able to see what I was getting at through the comments. One of the biggest speedups is finding the window edges with np.searchsorted. it doesn't quite work yet, but I'd bet it's just some index offsets that need tweaking

import pandas as pd
import numpy as np

tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)

sample_freq = 3 #days
sample_width = 5 #days

sample_freq *= 86400 #seconds per day
sample_width *= 86400 #seconds per day

times = df.index.astype(np.int64)//10**9  #array of timestamps (unix time)
cumsum = np.cumsum(df.A).as_matrix()  #array of cumulative sums (could eliminate extra summation with large overlap)
mat = np.array([times, cumsum]) #could eliminate temporary times and cumsum vars

def yieldstep(mat, freq):
    normtime = ((mat[0] - mat[0,0]) / freq).astype(int) #integer numbers indicating sample number
    for i in range(max(normtime)+1):
        yield np.searchsorted(normtime, i) #yield beginning of window index

def sumwindow(mat,i , width): #i is the start of the window returned by yieldstep
    normtime  = ((mat[0,i:] - mat[0,i])/ width).astype(int) #same as before, but we norm to window width
    j = np.searchsorted(normtime, i, side='right')-1 #find the right side of the window
    #return rightmost timestamp of window in seconds from unix epoch and sum of window
    return mat[0,j], mat[1,j] - mat[1,i] #sum of window is just end - start because we did a cumsum earlier

windowed_sums = np.array([sumwindow(mat, i, sample_width) for i in yieldstep(mat, sample_freq)])
Aaron
  • 10,133
  • 1
  • 24
  • 40
3

Looks like a rolling centered window where you pick up data every n days:

def rolleach(df, ndays, window):
    return df.rolling(window, center=True).sum()[ndays-1::ndays]

rolleach(df, 3, 5)
Out[95]: 
               A
2013-01-02  10.0
2013-01-05  25.0
2013-01-08  40.0
Zeugma
  • 31,231
  • 9
  • 69
  • 81
  • 1
    You're right! The OP says "trailing 5 days" but the result they provide is actually for a centered window! – ivan_pozdeev Oct 30 '16 at 05:16
  • I wonder how it does calculations. Does it compute every element in the result independently, iterating over corresponding chunks of data, or compute sums for every element in some linked manner then throw away skipped results? – ivan_pozdeev Oct 30 '16 at 05:27