0

I have a dataframe that has four different columns and looks like the table below:

index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00 2 A 2,000 2 2019-11-01 10:00:00 3 A 5,000 3 2019-12-01 08:00:00 4 B 1,000 4 2020-01-01 05:00:00 5 B 6,000 5 2019-01-01 01:00:00 6 B 7,000 6 2019-04-01 11:00:00 7 A 8,000 7 2019-11-30 07:00:00 8 B 500 8 2020-01-01 05:00:00 9 B 1,000 9 2020-01-01 03:00:00 10 B 2,000 10 2020-01-01 02:00:00 11 A 1,000 11 2019-05-02 01:00:00

Purpose:

For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.

Code example using a for loop which is not feasible given the size:

mean_dict = {}
for index,value in enumerate(df.datetime_column)):
    test_date = value
    test_column_a = df.column_a[index]
    subset_df = df[(df.datetime_column<test_date)&\ 
                (df.datetime_column>=test_date-timedelta(days = 180))&
                 (df.column_a == test_column_a)]
    mean_dict[index] = df.column_b.mean()

For example for row #1:

  • Target date = 2020-01-01 11:00:00
  • Target value in column_a = A
  • Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
  • Average would be the mean of rows 2,3,7

If I wanted average for row #2 then it would be:

  • Target date = 2019-11-01 10:00:00
  • Target value in column_a = A
  • Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
  • Average would be the mean of rows 11

and so on...

I cannot use the grouper since in reality I do not have dates but datetimes.

Has anyone encountered this before?

Thanks!


EDIT The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.

Max Payne
  • 387
  • 3
  • 17
  • 1
    I have difficulty in understanding your question. Could you read through carefully and see if you are putting out clearly what you are looking for? – SKPS Jan 28 '20 at 00:33
  • @SathishSanjeevi I have refactored the question, does it make more sense? – Max Payne Jan 28 '20 at 00:51
  • @JuanDaza could you just add expected `dataframe`? [JFYI](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Danila Ganchar Jan 29 '20 at 09:28

0 Answers0