Pandas Q-cut: Binning Data using an Expanding Window Approach

Question

This question is somewhat similar to a 2018 question I have found on an identical topic.

I am hoping that if I ask it in a simpler way, someone will be able to figure out a simple fix to the issue that I am currently facing:

I have a timeseries dataframe named "df", which is roughly structured as follows:

            V_1   V_2  V_3  V_4
1/1/2000    17    77   15   88
1/2/2000    85    78    6   59
1/3/2000    31    9    49   16
1/4/2000    81    55   28   33
1/5/2000    8     82   82   4
1/6/2000    89    87   57   62
1/7/2000    50    60   54   49
1/8/2000    65    84   29   26
1/9/2000    12    57   53   84
1/10/2000   6     27   70   56
1/11/2000   61    6    38   38
1/12/2000   22    8    82   58
1/13/2000   17    86   65   42
1/14/2000   9     27   42   86
1/15/2000   63    78   18   35
1/16/2000   73    13   51   61
1/17/2000   70    64   75   83

If I wanted to use all the columns to produce daily quantiles, I would follow this approach:

quantiles =  df.apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)

The output looks like this:

           V_1  V_2 V_3 V_4
2000-01-01  1   3   0   4
2000-01-02  4   3   0   3
2000-01-03  2   0   2   0
2000-01-04  4   1   0   0
2000-01-05  0   4   4   0
2000-01-06  4   4   3   3
2000-01-07  2   2   3   2
2000-01-08  3   4   1   0
2000-01-09  0   2   2   4
2000-01-10  0   1   4   2
2000-01-11  2   0   1   1
2000-01-12  1   0   4   2
2000-01-13  1   4   3   1
2000-01-14  0   1   1   4
2000-01-15  3   3   0   1
2000-01-16  4   0   2   3
2000-01-17  3   2   4   4

What I want to do:

I would like to produce quantiles of the data in "df" using observations that occurred before and at a specific point in time. I do not want to include observations that occurred after the specific point in time.

For instance:

To calculate the bins for the 2nd of January 2000, I would like to just use observations from the 1st and 2nd of January 2000; and, nothing after the dates;
To calculate the bins for the 3rd of January 2000, I would like to just use observations from the 1st, 2nd and 3rd of January 2000; and, nothing after the dates;
To calculate the bins for the 4th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd and 4th of January 2000; and, nothing after the dates;
To calculate the bins for the 5th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd, 4th and 5th of January 2000; and, nothing after the dates;

Otherwise put, I would like to use this approach to calculate the bins for ALL the datapoints in "df". That is, to calculate bins from the 1st of January 2000 to the 17th of January 2000.

In short, what I want to do is to conduct an expanding window q-cut (if there is any such thing). It helps to avoid "look-ahead" bias when dealing with timeseries data.

This code block below is wrong, but it illustrates exactly what I am trying to accomplish:

quantiles =  df.expanding().apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)

Does anyone have any ideas of how to do this in a simpler fashion than this

Please provide a sample of what you want your output to look like. An explanation of what you want is one thing, but a visual is worth a thousand words. — Ukrainian-serge, Mar 11 '20 at 16:02
I have provided an example of the output, i.e. if the binning is done on a daily basis. I am not clear about how I would do it if I wanted to use the expanding window approach that I explained. — john_mon, Mar 11 '20 at 17:32
I can reproduce you current output. What is essential for a good question is a sample example of what you WANT the output to look like. — Ukrainian-serge, Mar 11 '20 at 17:49
Does this answer your question? [How to access pandas DataFrame datetime index using strings](https://stackoverflow.com/questions/36871188/how-to-access-pandas-dataframe-datetime-index-using-strings) — Ukrainian-serge, Mar 11 '20 at 18:05

Ukrainian-serge · Answer 1 · 2020-03-11T18:07:04.497

0

I am new so take this with a grain of salt, but when broken down I believe your question is a duplicate because it requires simple datetime index slicing answered HERE.

lt_jan_5 = df.loc[:'2000-01-05'].apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)

print(lt_jan_5)

            V_1  V_2  V_3  V_4
2000-01-01    1    2    1    4
2000-01-02    4    3    0    3
2000-01-03    2    0    3    1
2000-01-04    3    1    2    2
2000-01-05    0    4    4    0

Hope this is helpful

edited Mar 11 '20 at 18:07

answered Mar 11 '20 at 17:46

Ukrainian-serge

854
7
12

Thank you for the answer. I had asked my question in a vague way, and; have now improved it to help you to understand what I seek to achieve. – john_mon Mar 12 '20 at 05:48

Pandas Q-cut: Binning Data using an Expanding Window Approach

1 Answers1