2

I'm trying to generate a cumulative column between any two given datetime ranges using a pandas pivot table, however it's unclear how to actually achieve this. I can create one for all dates as per below.

All datetimes:

                 sum count  cum_sum
dt
2015-01-01 10:00:00 10   10
2015-01-01 12:00:00 20   30
2015-01-01 13:00:00 30   60
2015-01-02 10:00:00 10   70
2015-01-02 12:00:00 20   90
2015-01-02 13:00:00 30  120

Between two specified datetimes:

                     sum count  cum_sum
dt
2015-01-01 12:00:00 20   30
2015-01-01 13:00:00 30   60
2015-01-02 10:00:00 10   70
2015-01-02 12:00:00 20   90

Is there a way to generate the table above, but start the accumulation from the start date in the query (or do this in the dataframe itself?).

my code:

import pandas as pd
import numpy as np
from datetime import datetime

data=[
{'count': 10, 'dt': datetime.strptime("20150101 10:00", "%Y%m%d %H:%M") },
{'count': 20, 'dt': datetime.strptime("20150101 12:00", "%Y%m%d %H:%M") },
{'count': 30, 'dt': datetime.strptime("20150101 13:00", "%Y%m%d %H:%M") },
{'count': 10, 'dt': datetime.strptime("20150102 10:00", "%Y%m%d %H:%M") },
{'count': 20, 'dt': datetime.strptime("20150102 12:00", "%Y%m%d %H:%M") },
{'count': 30, 'dt': datetime.strptime("20150102 13:00", "%Y%m%d %H:%M") }
]

df = pd.DataFrame(data)
df['cum_sum']=df['count'].cumsum()
pivot=pd.pivot_table(df, index=['dt'],aggfunc=[np.sum])
print (pivot)
result = pivot.query('dt >= "{0}" and dt <="   {1}"'.format(
    datetime.strptime("20150101 11:00", "%Y%m%d %H:%M"),
    datetime.strptime("20150102 12:00", "%Y%m%d %H:%M")
))
print (result)

Edit: I want to create a cumulative column between 2 date ranges but have sub-criteria.

data=[
    {'loc': 'Japan', 'count': 10, 'dt': datetime.strptime("20150101 10:00",         "%Y%m%d %H:%M") },
    {'loc': 'Japan', 'count': 20, 'dt': datetime.strptime("20150101 12:00", "%Y%m%d %H:%M") },
    {'loc': 'Japan', 'count': 30, 'dt': datetime.strptime("20150101 13:00", "%Y%m%d %H:%M") },
    {'loc': 'London', 'count': 10, 'dt': datetime.strptime("20150102 10:00", "%Y%m%d %H:%M") },
{'loc': 'London', 'count': 20, 'dt': datetime.strptime("20150102 12:00", "%Y%m%d %H:%M") },
{'loc': 'NewYork', 'count': 30, 'dt': datetime.strptime("20150102 13:00", "%Y%m%d %H:%M") }
    ]

so the output would be for a particular datetime range:

Loc                      Count cum_sum 
Japan
    2015-01-01 10:00:00 10       10
    2015-01-01 13:00:00 30       40
    2015-01-02 13:00:00 30       70
London
    2015-01-01 12:00:00 20       20
    2015-01-02 10:00:00 10       20
    2015-01-02 12:00:00 20       40
Lee
  • 29,398
  • 28
  • 117
  • 170
user2040800
  • 227
  • 4
  • 14

2 Answers2

1

Below is a simple, if not very sophisticated, way of doing things:

df = pd.DataFrame(data)
df.set_index('dt', inplace=True)
df['cumsum'] = df['count']
df.loc[df.index < datetime.strptime("20150101 11:00", "%Y%m%d %H:%M"), 'cumsum'] = 0.0
df['cumsum'] = df['cumsum'].cumsum()
print(df)

Gives the following result:

                     count  cumsum
dt                                
2015-01-01 10:00:00     10       0
2015-01-01 12:00:00     20      20
2015-01-01 13:00:00     30      50
2015-01-02 10:00:00     10      60
2015-01-02 12:00:00     20      80
2015-01-02 13:00:00     30     110
IanS
  • 15,771
  • 9
  • 60
  • 84
  • thanks - although how could this work for a range and is there a way to remove the dates that don't fall into this range, since 0 could be valid value I don't want to filter out on that alone. – user2040800 Sep 22 '15 at 08:44
  • If you're not interested in keeping the values outside of the range, then it's even simpler. Try something like `df = df[df.index >= datetime.strptime("20150101 11:00", "%Y%m%d %H:%M")]` and `df = df[df.index <= datetime.strptime("20150102 12:00", "%Y%m%d %H:%M")]`, then calculate the cumsum like you did. Unless I missed something? – IanS Sep 22 '15 at 09:30
1

You could redefine the Dataframe's index using the column of datetimes and use .ix, like this:

df.index = df.dt
time1=datetime.strptime("20150101 11:00", "%Y%m%d %H:%M")
time2=datetime.strptime("20150102 12:00", "%Y%m%d %H:%M")
df.ix[time1:time2]['count'].cumsum()

If you want to include all of the values from the first day, you could use the date() function of the time1 datetime object:

df.ix[time1.date():time2]['count'].cumsum()

gives:

2015-01-01 10:00:00    10
2015-01-01 12:00:00    30
2015-01-01 13:00:00    60
2015-01-02 10:00:00    70
2015-01-02 12:00:00    90
Name: count, dtype: int64

to get the output you asked for, starting at time1, you can add [time1:]:

df.ix[time1.date():time2]['count'].cumsum()[time1:]

giving:

2015-01-01 12:00:00    30
2015-01-01 13:00:00    60
2015-01-02 10:00:00    70
2015-01-02 12:00:00    90
Name: count, dtype: int64

Edit

In answer to your follow up question, you could use groupby (taken from this answer):

df.index=df.dt
df=df.ix[time1.date():time2]['count'].reset_index() # filter times and remove date index
df.groupby(by=['loc','dt']).sum().groupby(level=[0]).cumsum()

Gives:

                            count
loc     dt                        
Japan   2015-01-01 10:00:00     10
        2015-01-01 12:00:00     30
        2015-01-01 13:00:00     60
London  2015-01-02 10:00:00     10
        2015-01-02 12:00:00     30
NewYork 2015-01-02 13:00:00     30
Community
  • 1
  • 1
Lee
  • 29,398
  • 28
  • 117
  • 170
  • thanks, this is what I was looking for! I'm applying this to a DF where dt index isn't necessarily unique anymore, ie. as well as {'count': 0} there will be {'Location': 'Japan'}, if I wanted to count the accumulated value between 2 dates ranges (as per above) by location, (Location -> Date -> Accumulated Count) Do you know how to achieve that? – user2040800 Sep 24 '15 at 02:31
  • Edited above to demonstrate this. – user2040800 Sep 24 '15 at 07:12
  • 1
    wish I had the rep to upvote this. Got a key error on 'loc', seems like the ix filter removes this column (so I added it back in before the grouping) temp=df['loc'] then df['loc']=temp after the filter. – user2040800 Sep 25 '15 at 00:49