Selecting sections of data from a python dataframe with np.nans

Question

I have some code which produces a dataframe output, of columns date, and x (a given value). df=

   index      date                    colx
2018-08-09    NaN                     NaN  
2018-08-10    2018-08-10 00:00:00 -0.200460 
2018-08-13    NaN                     NaN  
2018-08-14    NaN                     NaN 
2018-08-15    NaN                     NaN 
2018-08-16    NaN                     NaN  
2018-08-17    NaN                     NaN  
2018-08-20    NaN                     NaN  
2018-08-21    NaN                     NaN 
2018-08-22    2018-08-22 00:00:00 -2.317475 
2018-08-23    2018-08-23 00:00:00 -1.652724 
2018-08-24    2018-08-24 00:00:00 -3.669870 
2018-08-27    2018-08-27 00:00:00 -3.807074 
2018-08-28    2018-08-28 00:00:00 -0.257006 
2018-08-29    NaN                     NaN  
2018-08-30    2018-08-30 00:00:00 -0.374825 
2018-08-31    2018-08-31 00:00:00 -5.655345 
2018-09-03    2018-09-03 00:00:00 -4.631105 
2018-09-04    2018-09-04 00:00:00 -4.722768 
2018-09-05    2018-09-05 00:00:00 -3.012673 
2018-09-06    NaN                     NaN

Date column is the same as the index, for selected values, and np.nan for other sections.

What I am looking to achieve and unsure as to how, is to extract the first date and last date of a block of data (without the 00:00:00)

With the help of the following link I am able to tackle the issue of cumsum but not the extraction of the data into the required output below:

python pandas conditional cumulative sum

b = df.colx    
c = b.cumsum()
df['cumsumcolx']=c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(float)

This code gives me:

   index      date                    colx     cumsumcolx
2018-08-09    0                        0           0    
2018-08-10    2018-08-10 00:00:00 -0.200460    -0.200460 
2018-08-13    0                        0           0  
2018-08-14    0                        0           0  
2018-08-15    0                        0           0  
2018-08-16    0                        0           0  
2018-08-17    0                        0           0  
2018-08-20    0                        0           0  
2018-08-21    0                        0           0  
2018-08-22    2018-08-22 00:00:00 -2.317475     -2.317475
2018-08-23    2018-08-23 00:00:00 -1.652724     -3.970198
2018-08-24    2018-08-24 00:00:00 -3.669870     -7.640069
2018-08-27    2018-08-27 00:00:00 -3.807074     -11.447143
2018-08-28    2018-08-28 00:00:00 -0.257006     -11.704148
2018-08-29    0                        0           0  
2018-08-30    2018-08-30 00:00:00 -0.374825     -0.374825
2018-08-31    2018-08-31 00:00:00 -5.655345     -6.030169
2018-09-03    2018-09-03 00:00:00 -4.631105     -10.661275
2018-09-04    2018-09-04 00:00:00 -4.722768     -15.384043
2018-09-05    2018-09-05 00:00:00 -3.012673     -18.396715
2018-09-06    0                        0           0

Thus, im asking for help with extraction so that i achieve an expected output of a table/dataframe:

entrydate     exitdate      cumsumcolx 
2018-08-10    2018-08-10    -0.200460
2018-08-22    2018-08-28    -11.704148
2018-08-30    2018-09-05    -18.396715

my df is very long, thus just taken a snippet of it for illustration purposes.

Thank you

Why is `exitdate` in the last line of the example output `2018-08-31` and not `2018-09-05` ? — Chris Adams, Sep 22 '18 at 09:42
I think the answer to this question will help you a lot: https://stackoverflow.com/questions/41420822/python-pandas-conditional-cumulative-sum — onno, Sep 22 '18 at 09:44
@ChrisA you are right in pointing out my mistake, it should be '2018-09-05'. Thank you for spotting this, I will edit the post — Junaid Mohammad, Sep 22 '18 at 09:45
@onno thank you I will investigate, I don't think it will help with the extraction of entrydate and exitdate columns required, however, I will try to keep working on the problem — Junaid Mohammad, Sep 22 '18 at 09:56

score 1 · Accepted Answer · answered Sep 22 '18 at 10:10

1

First you need to label the separations between groups:

blanks = df.date.isnull()

Then label the groups themselves:

df['group'] = blanks.cumsum()

Now you have a column which labels each group, with one small defect in that the first member of each group is a NAN row. Simply remove such rows:

df = df[~blanks]

Then use groupby:

grouped = df.groupby('group')
entrydate = grouped.date.first()
exitdate = grouped.date.last()
cumsumcolx = grouped.colx.sum()

answered Sep 22 '18 at 10:10

John Zwinck

239,568
38
324
436

thank you, I follow the solution until we get to the grouping. I understand that you group, but entrydate and exitdate only give the entry and exit of the first group, not all of them, therefore I cant achieve the desired output. Is there a way to iterate over the groups to extract the relevant information? The solution to this point is very elegant, thank you – Junaid Mohammad Sep 22 '18 at 11:14

score 0 · Answer 2 · answered Sep 22 '18 at 14:03

Selecting sections of data from a python dataframe with np.nans

2 Answers2