1

OK, so this is a tricky question to describe.

I am putting together a compliance report for a study, whereby the team needs to know if each user performed a particular task on 7 out of 8 days at least once per month.

So I need a way to:

  1. For a given user and a given month, search the month for an 8 day period where at least 7 out of those 8 days were marked "TRUE"
  2. Return a 1 is requirement was met, 0 if requirement was not met.

Here is an example of the data structure:

import pandas as pd

ids = 1
req_met = ['TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE']
date = ['2018-01-01', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08', '2018-01-09', '2018-01-10', '2018-01-11',
'2018-01-12',  '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16', '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-21', '2018-01-23']

df = pd.DataFrame({'id':ids,
                   'Date':date,
                   'req_met':req_met})
print(df)

    id        Date req_met
0    1  2018-01-01    TRUE
1    1  2018-01-03    TRUE
2    1  2018-01-04   FALSE
3    1  2018-01-05   FALSE
4    1  2018-01-06    TRUE
5    1  2018-01-07    TRUE
6    1  2018-01-08    TRUE
7    1  2018-01-09   FALSE
8    1  2018-01-10    TRUE
9    1  2018-01-11    TRUE
10   1  2018-01-12    TRUE
11   1  2018-01-13    TRUE
12   1  2018-01-14   FALSE
13   1  2018-01-15   FALSE
14   1  2018-01-16   FALSE
15   1  2018-01-17   FALSE
16   1  2018-01-18    TRUE
17   1  2018-01-19    TRUE
18   1  2018-01-21    TRUE
19   1  2018-01-23   FALSE

For this user, the answer returned would be '1' because they do have an 8 day period whereby 7 of those days were 'TRUE' (2018-01-06 to 2018-01-13). You can see that the date range is not always consecutive which is an added complication.

My desired output would be the most efficient function that could take this data and return either a '1' (requirement met) or a '0' (requirement not met)

Thanks in advance for any assistance.

Dr Wampa
  • 443
  • 3
  • 8
  • and I assume you don't want to just loop through them one by one? – Mahrkeenerh Oct 06 '21 at 22:01
  • If possible not but if that's the best option and it's going to not add too much time then that should be OK. Thanks! – Dr Wampa Oct 06 '21 at 22:16
  • I mean, if it's a couple thousand of lines, it should be done in a split second. Don't konw how many entries you have. – Mahrkeenerh Oct 06 '21 at 22:18
  • You mentioned that you have some gaps in the dates, do the 8-day windows need to be contiguous? Or can it be any sequential 8-day window? – ddejohn Oct 06 '21 at 22:18
  • Yes the 8 days have to be sequential but only 7 of those 8 days have to be true. It can be any 8 day window within the calendar month. For the dates that are missing you can count them as FALSE since the user didn't log any activity that day. Hope that makes sense. Thanks! – Dr Wampa Oct 06 '21 at 22:27
  • @Mahrkeenerh yes I think loops should be fine since the data here is already split into month ranges. Thanks! – Dr Wampa Oct 06 '21 at 22:29

2 Answers2

2

First convert types to bool and datetime:

df['req_met'] = df['req_met'].replace({'TRUE':True, 'FALSE':False})
df['Date'] = pd.to_datetime(df.Date)

If there are missing dates and we assume user didn't log any activity that day, we need to insert missing days:

df = (df.set_index('Date')
        .groupby('id').req_met
        .resample('D').sum()
        .reset_index()
        )

df['Month'] = df.Date.dt.strftime("%Y-%m")

Notice that resample is used per id of user to ensure consecutive calendar days per user. Then using rolling method:

df_result = (df.groupby(['id','Month'])
             .rolling(8)['req_met'].sum().ge(7)
             .groupby(['id','Month'])
             .agg({('req_met','max')})
             .reset_index()
             )

the result is:

   id    Month  req_met
0   1  2018-01     True

Please note groupby is used twice. You can inspect calculations by running code step-by-step to fully understand the logic.

ipj
  • 3,488
  • 1
  • 14
  • 18
1

EDIT: my bad I misread your response to my question and thought you were confirming that the 8-day windows needed to be consecutive. Since that's not the case, @ipj's answer works great.

I'll leave this up for future readers wanting something similar but for consecutive windows.

Old answer, when I mistakenly thought that for any given 8-day window, the dates needed be consecutive

First, I'll concatenate a couple copies with different id values:

df1 = pd.DataFrame({'id':ids, 'Date': date, 'req_met': req_met})

df2 = df1.copy()
df2.id = 2

df3 = df1.copy()
df3.id = 3

df = pd.concat([df1, df2, df3]).reset_index(drop=True)
df.Date = pd.to_datetime(df.Date)

Now, create a mask for rows which aren't consecutive dates (modified from this answer):

>>> mask = (df.Date.diff(-1).dt.days == -1) | (df.Date.diff().dt.days == 1)

Now you can do a rolling sum:

>>> (df[mask].groupby("id").req_met.rolling(8).sum() >= 7).groupby("id").sum()
id
1    1
2    1
3    1
Name: req_met, dtype: int64

All three are 1 just because I just copied the original dataframe three times but this should work for whatever actual dataframe you have. If your data aren't already grouped by date, then you'll need to add that to the groupby.

ddejohn
  • 8,775
  • 3
  • 17
  • 30