0

I have a script which generates dataframes containing 'blocks' of dates and stores them in a variable 'cur_blocks' such as this one:

                  year      start        end       date
gmt_reportedtime                                       
2019-03-11        2000 2000-01-09 2000-03-10 2019-03-11
2019-03-11        2001 2001-01-09 2001-03-10 2019-03-11
2019-03-11        2002 2002-01-09 2002-03-10 2019-03-11
2019-03-11        2003 2003-01-09 2003-03-10 2019-03-11
2019-03-11        2004 2004-01-09 2004-03-10 2019-03-11

I also have another dataframe 'col' indexed on datetime which has data in it that I would like to do downstream calculations such as ranking and z-scores on. I would like to slice multiple chunks of col using the blocks denoted in 'start' and 'end' of 'cur_blocks'. The date blocks in 'cur_blocks' are 60 day windows each separated by 1 year between 2000-2019.

Here is 'col':

              X
2000-01-01   24
2000-01-02   90
2000-01-03   62
2000-01-04   10
...
2019-02-28   73
2019-03-01   96

[7000 rows x 1 columns]

So far I have solved it like this:

window_aggregate = pd.DataFrame()
for index, block in cur_blocks.iterrows():
    dt_block = col.loc[block['start']:block['end']]
    window_aggregate = pd.concat([window_aggregate, dt_block])

I can then use 'window_aggregate' to do downstream work. This works, however it is very slow because 'cur_blocks' is generated for every single date between 2000-01-01 and today.

I am surprised that there isn't a built in function in pandas to do this kind of thing? Unless I missed it? But another user on stack overflow has also posed a similar question which so far hasn't been answered here

I recently posted a question detailing the broader problem here, although the code I've written mostly solves the problem it is far too slow for practical purposes.

I've seen it solved for a single day over several years here but not with a window.

There may be a solution here but I don't understand how to apply the answer to my data

EDIT: showing desired output something like below with the dates in the dataframe corresponding to the start and end blocks in cur_blocks

Desired output

              X
2000-01-09   24
2000-01-10   90
...
2000-03-10   62
2001-01-09   10
2001-01-10   10
...
2001-03-10   10
2002-01-09   10
...
...
...
2004-03-10   73
user3062260
  • 1,584
  • 4
  • 25
  • 53
  • Why would `cur_blocks` be generated for every single date? It should just go row-by row and apply the windows you created. Also would be helpful to show your desired output. – elPastor Mar 13 '19 at 11:32
  • I was getting an error creating the `for` loop you had, but was working fine if you replace that line and the following with: `for index, block in cur_blocks.iterrows(): df = col[(col.index > block['start']) & (col.index < block['end'])]` – elPastor Mar 13 '19 at 11:44
  • just noticed the loop error elPastor and now updated. – user3062260 Mar 13 '19 at 11:46
  • So the start and end cols in cur_clocks have different values for each date - when the date column is 1 day later all the values in the start and end cols will also be one day later - I can't simply use the same block of dates for each day – user3062260 Mar 13 '19 at 11:48
  • Part of the problem with helping you is that your use of "dates" and "days" is confusing and can seemingly be interchanged. Would be best to reference the actual column names. – elPastor Mar 13 '19 at 12:08
  • I appreciate you taking a look for me, apologies for the confusion: yes you are right - a given date should appear only a single time so I have somewhat carelessly used the term date and day interchangeably. By 'day' I mean a single date. Please let me know if you need me to clarify anything else – user3062260 Mar 13 '19 at 12:18

1 Answers1

0

I would merge both dataframes on year. Then you could simply filter the resulting dataframes with rows where the date from col is between the start and end date. After the various index manipulations required for the merge, and a final column selection, it gives:

pd.DataFrame(pd.concat([col, pd.Series(col.index.year,
                       index=col.index, name='year')],axis=1)
         .rename_axis('dat').reset_index().merge(cur_blocks, on='year')
         .query('(start <= dat) & (dat <= end)').set_index('dat')
         .rename_axis('')['X'])
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252