A general approach would be like this:
# 1. To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])
# 2. Standardize the dates by offseting them
df["ColB_std"] = df["ColB"] - pd.to_timedelta(range(df.shape[0]), 'day')
# 3. Counting each instance of ColA and standardized date
s = df.groupby(["ColA", "ColB_std"])["ColB_std"].count()
# 4. Getting elements from ColA that have at least 1 sequence of at least length 3
colA = s[ s >= 3 ].index.get_level_values(0).unique().values
# 5. Filtering the dataframe
df[ df["ColA"].isin(colA) ]
You want ColA
s with 3 consecutive dates. Or you could think of it like you want ColA
s where there's a sequence of date
, date + 1 day
and date + 2 days
. By sorting the dataframe by ColA and ColB (1), we know that in the case you want to check, date + 1 day
will always follow date
, and date + 2 days
will be the following that one.
With this, you can standardize the dates by removing n
days corresponding to their row. So, the sequence of date
, date + 1 day
and date + 2 days
, becomes date
, date
and date
(2).
Now that we have the date column standardized, we just need to count how many elements each pair ('ColA', 'ColB_std')
exist (3), get the elements from ColA
that have counts of 3 or more (4), and filter the dataframe (5).
However, this doesn't support duplicated pairs of ('ColA', 'ColB')
, for that you'd need to do this first:
df2 = df.drop_duplicates(["ColA", "ColB"])
Proceding to use this df2
in steps 1, 2, 3 and 4, and in the end filtering the real df
in step 5.
Previously, I answered that you also could do it like this:
# To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])
# Calculating the difference between dates inside each group
s = df.groupby("ColA")["ColB"].diff().dt.days
# Filtering the dataframe
df[ ((s == 1) & (s.shift(1) == 1)).groupby(df["ColA"]).transform("any") ]
The idea is that in s
, the difference is always between the previous date, and the current date. However, this doesn't make sure that there are 3 consecutive dates, just 2. By shifting the series by 1, you are make sure that the current different and the previous one are 1 [ (s == 1) & (s.shift(1) == 1)
].
After that, I just groupby(df["ColA"])
, and check if any element inside the group is true with transform("any")
.