1

I have some code which produces a dataframe output, of columns date, and x (a given value). df=

   index      date                    colx
2018-08-09    NaN                     NaN  
2018-08-10    2018-08-10 00:00:00 -0.200460 
2018-08-13    NaN                     NaN  
2018-08-14    NaN                     NaN 
2018-08-15    NaN                     NaN 
2018-08-16    NaN                     NaN  
2018-08-17    NaN                     NaN  
2018-08-20    NaN                     NaN  
2018-08-21    NaN                     NaN 
2018-08-22    2018-08-22 00:00:00 -2.317475 
2018-08-23    2018-08-23 00:00:00 -1.652724 
2018-08-24    2018-08-24 00:00:00 -3.669870 
2018-08-27    2018-08-27 00:00:00 -3.807074 
2018-08-28    2018-08-28 00:00:00 -0.257006 
2018-08-29    NaN                     NaN  
2018-08-30    2018-08-30 00:00:00 -0.374825 
2018-08-31    2018-08-31 00:00:00 -5.655345 
2018-09-03    2018-09-03 00:00:00 -4.631105 
2018-09-04    2018-09-04 00:00:00 -4.722768 
2018-09-05    2018-09-05 00:00:00 -3.012673 
2018-09-06    NaN                     NaN 

Date column is the same as the index, for selected values, and np.nan for other sections.

What I am looking to achieve and unsure as to how, is to extract the first date and last date of a block of data (without the 00:00:00)

With the help of the following link I am able to tackle the issue of cumsum but not the extraction of the data into the required output below:

python pandas conditional cumulative sum

b = df.colx    
c = b.cumsum()
df['cumsumcolx']=c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(float)

This code gives me:

   index      date                    colx     cumsumcolx
2018-08-09    0                        0           0    
2018-08-10    2018-08-10 00:00:00 -0.200460    -0.200460 
2018-08-13    0                        0           0  
2018-08-14    0                        0           0  
2018-08-15    0                        0           0  
2018-08-16    0                        0           0  
2018-08-17    0                        0           0  
2018-08-20    0                        0           0  
2018-08-21    0                        0           0  
2018-08-22    2018-08-22 00:00:00 -2.317475     -2.317475
2018-08-23    2018-08-23 00:00:00 -1.652724     -3.970198
2018-08-24    2018-08-24 00:00:00 -3.669870     -7.640069
2018-08-27    2018-08-27 00:00:00 -3.807074     -11.447143
2018-08-28    2018-08-28 00:00:00 -0.257006     -11.704148
2018-08-29    0                        0           0  
2018-08-30    2018-08-30 00:00:00 -0.374825     -0.374825
2018-08-31    2018-08-31 00:00:00 -5.655345     -6.030169
2018-09-03    2018-09-03 00:00:00 -4.631105     -10.661275
2018-09-04    2018-09-04 00:00:00 -4.722768     -15.384043
2018-09-05    2018-09-05 00:00:00 -3.012673     -18.396715
2018-09-06    0                        0           0  

Thus, im asking for help with extraction so that i achieve an expected output of a table/dataframe:

entrydate     exitdate      cumsumcolx 
2018-08-10    2018-08-10    -0.200460
2018-08-22    2018-08-28    -11.704148
2018-08-30    2018-09-05    -18.396715

my df is very long, thus just taken a snippet of it for illustration purposes.

Thank you

Junaid Mohammad
  • 457
  • 1
  • 6
  • 18
  • Why is `exitdate` in the last line of the example output `2018-08-31` and not `2018-09-05` ? – Chris Adams Sep 22 '18 at 09:42
  • I think the answer to this question will help you a lot: https://stackoverflow.com/questions/41420822/python-pandas-conditional-cumulative-sum – onno Sep 22 '18 at 09:44
  • 1
    @ChrisA you are right in pointing out my mistake, it should be '2018-09-05'. Thank you for spotting this, I will edit the post – Junaid Mohammad Sep 22 '18 at 09:45
  • @onno thank you I will investigate, I don't think it will help with the extraction of entrydate and exitdate columns required, however, I will try to keep working on the problem – Junaid Mohammad Sep 22 '18 at 09:56

2 Answers2

1

First you need to label the separations between groups:

blanks = df.date.isnull()

Then label the groups themselves:

df['group'] = blanks.cumsum()

Now you have a column which labels each group, with one small defect in that the first member of each group is a NAN row. Simply remove such rows:

df = df[~blanks]

Then use groupby:

grouped = df.groupby('group')
entrydate = grouped.date.first()
exitdate = grouped.date.last()
cumsumcolx = grouped.colx.sum()
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • thank you, I follow the solution until we get to the grouping. I understand that you group, but entrydate and exitdate only give the entry and exit of the first group, not all of them, therefore I cant achieve the desired output. Is there a way to iterate over the groups to extract the relevant information? The solution to this point is very elegant, thank you – Junaid Mohammad Sep 22 '18 at 11:14
0

Similar another solution as below:

# Python Code
def AggSum(dfg):
    return pd.DataFrame([[dfg.iloc[0].idx, dfg.iloc[-1].date, dfg.colx.sum()]],
                        columns=['entrydate', 'exitdate', 'cumsumcolx'])

df['idx'] = pd.to_datetime(df['idx'])
df['date'] = pd.to_datetime(df['date'])
df['Group'] = df.colx.isnull().cumsum()
df2 = df[df.colx.notnull()].groupby('Group', as_index=False).apply(AggSum)
df2.reset_index(drop=True, inplace=True)

#Output dataframe
   entrydate   exitdate  cumsumcolx
0 2018-08-10 2018-08-10   -0.200460
1 2018-08-22 2018-08-28  -11.704149
2 2018-08-30 2018-09-05  -18.396716
nandneo
  • 495
  • 4
  • 13