-1

I have a dataframe:

  ColA  ColB

0   A    1/2/2020 
1   A    1/3/2020       
2   A    1/4/2020
3   A    1/10/2020 
4   B    1/3/2020
5   B    1/19/2020
6   C    1/2/2020
7   C    1/7/2020 
8   D    1/8/2020

Now I want to find out the name of the series in colA which has three consecutive days in colB.

Output:

the answer would be A since it has 1/2/2020, 1/3/2020 and 1/4/2020 in colB.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
ubuntu_noob
  • 2,305
  • 6
  • 23
  • 64

1 Answers1

0

A general approach would be like this:

# 1. To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])

# 2. Standardize the dates by offseting them
df["ColB_std"] = df["ColB"] - pd.to_timedelta(range(df.shape[0]), 'day')

# 3. Counting each instance of ColA and standardized date
s = df.groupby(["ColA", "ColB_std"])["ColB_std"].count()

# 4. Getting elements from ColA that have at least 1 sequence of at least length 3
colA = s[ s >= 3 ].index.get_level_values(0).unique().values

# 5. Filtering the dataframe
df[ df["ColA"].isin(colA) ]

You want ColAs with 3 consecutive dates. Or you could think of it like you want ColAs where there's a sequence of date, date + 1 day and date + 2 days. By sorting the dataframe by ColA and ColB (1), we know that in the case you want to check, date + 1 day will always follow date, and date + 2 days will be the following that one.

With this, you can standardize the dates by removing n days corresponding to their row. So, the sequence of date, date + 1 day and date + 2 days, becomes date, date and date (2).

Now that we have the date column standardized, we just need to count how many elements each pair ('ColA', 'ColB_std') exist (3), get the elements from ColA that have counts of 3 or more (4), and filter the dataframe (5).

However, this doesn't support duplicated pairs of ('ColA', 'ColB'), for that you'd need to do this first:

df2 = df.drop_duplicates(["ColA", "ColB"])

Proceding to use this df2 in steps 1, 2, 3 and 4, and in the end filtering the real df in step 5.

Previously, I answered that you also could do it like this:

# To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])

# Calculating the difference between dates inside each group
s = df.groupby("ColA")["ColB"].diff().dt.days

# Filtering the dataframe
df[ ((s == 1) & (s.shift(1) == 1)).groupby(df["ColA"]).transform("any") ]

The idea is that in s, the difference is always between the previous date, and the current date. However, this doesn't make sure that there are 3 consecutive dates, just 2. By shifting the series by 1, you are make sure that the current different and the previous one are 1 [ (s == 1) & (s.shift(1) == 1) ].

After that, I just groupby(df["ColA"]), and check if any element inside the group is true with transform("any").

MkWTF
  • 1,372
  • 7
  • 11
  • I am getting this error-ufunc subtract cannot use operands with types dtype(' – ubuntu_noob Jan 29 '20 at 16:12
  • Can you validate that when you print `df.dtypes`, `ColB` is of type `datetime64[ns]`? If this is the case, can you provide me an example of your error in a comment or in an online IDE like [repl.it](https://repl.it/) ? – MkWTF Jan 29 '20 at 16:51
  • Can you explain the 3rd step? – ubuntu_noob Feb 05 '20 at 09:31
  • Here is a image, it might help besides the explanation in the answer: https://imgur.com/BefUOub – MkWTF Feb 08 '20 at 00:07