6

I have a dataset with an id column, date column and value. I would like to count the consecutive appearances/duplicate values of id for a continuous date range.

My question is very much like Count consecutive duplicate values by group but in Python. Moreover, the question is different from How to find duplicates in pandas dataframe because I need the count to be based on two columns where one is not identical - it is the date (which changes, but if it's consecutive I want to count it)

Here is a sample dataset:

ID      tDate            value
79  2019-06-21 00:00:00  397
79  2019-07-13 00:00:00  404
79  2019-07-18 00:00:00  405
79  2019-07-19 00:00:00  406
79  2019-08-02 00:00:00  410
79  2019-08-09 00:00:00  413

I would like to resultant dataset to be:

ID      tDate            val  consec_count
79  2019-06-21 00:00:00  397  0
79  2019-07-13 00:00:00  404  0
79  2019-07-18 00:00:00  405  1
79  2019-07-19 00:00:00  406  2
79  2019-08-02 00:00:00  410  0
79  2019-08-09 00:00:00  413  0

I marked the "singles" with 0 and not 1 because I need to tell the two apart. I will deal with batches of 'duplicates' differently than single records.

Thank you!

sa_zy
  • 331
  • 1
  • 10

2 Answers2

5

Sample:

df = pd.DataFrame({'ID': [79, 79, 79, 79, 79, 79, 80, 80, 80, 80, 80, 80, 80], 
                   'tDate': [pd.Timestamp('2019-07-12 00:00:00'),
                             pd.Timestamp('2019-07-13 00:00:00'),
                             pd.Timestamp('2019-07-18 00:00:00'),
                             pd.Timestamp('2019-07-19 00:00:00'),
                             pd.Timestamp('2019-07-20 00:00:00'),
                             pd.Timestamp('2019-08-03 00:00:00'), 
                             pd.Timestamp('2019-06-21 00:00:00'), 
                             pd.Timestamp('2019-06-22 00:00:00'), 
                             pd.Timestamp('2019-07-18 00:00:00'), 
                             pd.Timestamp('2019-07-19 00:00:00'), 
                             pd.Timestamp('2019-07-26 00:00:00'), 
                             pd.Timestamp('2019-08-02 00:00:00'), 
                             pd.Timestamp('2019-08-03 00:00:00')],
                   'value':[397, 404, 405, 406, 408, 413, 397, 404, 405, 406, 408, 410, 413]})

print (df)
    ID      tDate  value
0   79 2019-07-12    397
1   79 2019-07-13    404
2   79 2019-07-18    405
3   79 2019-07-19    406
4   79 2019-07-20    408
5   79 2019-08-03    413
6   80 2019-06-21    397
7   80 2019-06-22    404
8   80 2019-07-18    405
9   80 2019-07-19    406
10  80 2019-07-26    408
11  80 2019-08-02    410
12  80 2019-08-03    413

Solution:

a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d'))
s = (~a).cumsum()
df['consec_count']=np.where(a.groupby(s).transform('any'), df.groupby(s).cumcount(1).add(1),0)

print (df)
    ID      tDate  value  consec_count
0   79 2019-07-12    397             1
1   79 2019-07-13    404             2
2   79 2019-07-18    405             1
3   79 2019-07-19    406             2
4   79 2019-07-20    408             3
5   79 2019-08-03    413             0
6   80 2019-06-21    397             1
7   80 2019-06-22    404             2
8   80 2019-07-18    405             1
9   80 2019-07-19    406             2
10  80 2019-07-26    408             0
11  80 2019-08-02    410             1
12  80 2019-08-03    413             2

Explanation:

First create mask for compare difference per groups by DataFrameGroupBy.diff with one day:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d'))))
    ID      tDate  value    diff      a
0   79 2019-07-12    397     NaT  False
1   79 2019-07-13    404  1 days   True
2   79 2019-07-18    405  5 days  False
3   79 2019-07-19    406  1 days   True
4   79 2019-07-20    408  1 days   True
5   79 2019-08-03    413 14 days  False
6   80 2019-06-21    397     NaT  False
7   80 2019-06-22    404  1 days   True
8   80 2019-07-18    405 26 days  False
9   80 2019-07-19    406  1 days   True
10  80 2019-07-26    408  7 days  False
11  80 2019-08-02    410  7 days  False
12  80 2019-08-03    413  1 days   True

Create unique groups by Series.cumsum with inverted condition by ~:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
                 a_neg = ~a,
                 s = (~a).cumsum()))

    ID      tDate  value    diff      a  a_neg  s
0   79 2019-07-12    397     NaT  False   True  1
1   79 2019-07-13    404  1 days   True  False  1
2   79 2019-07-18    405  5 days  False   True  2
3   79 2019-07-19    406  1 days   True  False  2
4   79 2019-07-20    408  1 days   True  False  2
5   79 2019-08-03    413 14 days  False   True  3
6   80 2019-06-21    397     NaT  False   True  4
7   80 2019-06-22    404  1 days   True  False  4
8   80 2019-07-18    405 26 days  False   True  5
9   80 2019-07-19    406  1 days   True  False  5
10  80 2019-07-26    408  7 days  False   True  6
11  80 2019-08-02    410  7 days  False   True  7
12  80 2019-08-03    413  1 days   True  False  7

Crete mask by GroupBy.transform and DataFrameGroupBy.any for test if each group contains at least one True - then all values of group are set to Trues:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
                 a_neg = ~a,
                 s = (~a).cumsum(),
                 mask = a.groupby(s).transform('any')))

    ID      tDate  value  consec_count    diff      a  a_neg  s   mask
0   79 2019-07-12    397             1     NaT  False   True  1   True
1   79 2019-07-13    404             2  1 days   True  False  1   True
2   79 2019-07-18    405             1  5 days  False   True  2   True
3   79 2019-07-19    406             2  1 days   True  False  2   True
4   79 2019-07-20    408             3  1 days   True  False  2   True
5   79 2019-08-03    413             0 14 days  False   True  3  False
6   80 2019-06-21    397             1     NaT  False   True  4   True
7   80 2019-06-22    404             2  1 days   True  False  4   True
8   80 2019-07-18    405             1 26 days  False   True  5   True
9   80 2019-07-19    406             2  1 days   True  False  5   True
10  80 2019-07-26    408             0  7 days  False   True  6  False
11  80 2019-08-02    410             1  7 days  False   True  7   True
12  80 2019-08-03    413             2  1 days   True  False  7   True

Create counter per groups s by GroupBy.cumcount:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
                 a_neg = ~a,
                 s = (~a).cumsum(),
                 mask = a.groupby(s).transform('any'),
                 c = df.groupby(s).cumcount(1).add(1)))

    ID      tDate  value  consec_count    diff      a  a_neg  s   mask  c
0   79 2019-07-12    397             1     NaT  False   True  1   True  1
1   79 2019-07-13    404             2  1 days   True  False  1   True  2
2   79 2019-07-18    405             1  5 days  False   True  2   True  1
3   79 2019-07-19    406             2  1 days   True  False  2   True  2
4   79 2019-07-20    408             3  1 days   True  False  2   True  3
5   79 2019-08-03    413             0 14 days  False   True  3  False  1
6   80 2019-06-21    397             1     NaT  False   True  4   True  1
7   80 2019-06-22    404             2  1 days   True  False  4   True  2
8   80 2019-07-18    405             1 26 days  False   True  5   True  1
9   80 2019-07-19    406             2  1 days   True  False  5   True  2
10  80 2019-07-26    408             0  7 days  False   True  6  False  1
11  80 2019-08-02    410             1  7 days  False   True  7   True  1
12  80 2019-08-03    413             2  1 days   True  False  7   True  2

And last add 0 by numpy.where with mask mask:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
                 a_neg = ~a,
                 s = (~a).cumsum(),
                 mask = a.groupby(s).transform('any'),
                 c = df.groupby(s).cumcount(1).add(1),
                 out =  np.where(mask, df.groupby(s).cumcount(1).add(1), 0)))

    ID      tDate  value  consec_count    diff      a  a_neg  s   mask  c  out
0   79 2019-07-12    397             1     NaT  False   True  1   True  1    1
1   79 2019-07-13    404             2  1 days   True  False  1   True  2    2
2   79 2019-07-18    405             1  5 days  False   True  2   True  1    1
3   79 2019-07-19    406             2  1 days   True  False  2   True  2    2
4   79 2019-07-20    408             3  1 days   True  False  2   True  3    3
5   79 2019-08-03    413             0 14 days  False   True  3  False  1    0
6   80 2019-06-21    397             1     NaT  False   True  4   True  1    1
7   80 2019-06-22    404             2  1 days   True  False  4   True  2    2
8   80 2019-07-18    405             1 26 days  False   True  5   True  1    1
9   80 2019-07-19    406             2  1 days   True  False  5   True  2    2
10  80 2019-07-26    408             0  7 days  False   True  6  False  1    0
11  80 2019-08-02    410             1  7 days  False   True  7   True  1    1
12  80 2019-08-03    413             2  1 days   True  False  7   True  2    2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you very much! This solves it! Would you mind going into some more detail explaining the solution? I don't follow it... – sa_zy Sep 23 '19 at 08:52
  • @sa_zy - For first solution? – jezrael Sep 23 '19 at 08:54
  • 1
    Second solution. I don't follow the groupby((~a).cumsum()) especially after having selected df[a], how are the ~a still there? – sa_zy Sep 23 '19 at 08:55
  • 1
    @sa_zy - Added explantion to answer. – jezrael Sep 23 '19 at 09:02
  • there seems to be a bug with your implementation that doesn't appear in Andy L.'s solution. I think the or-condition in a for date-diff -1 is redundant. – sa_zy Sep 24 '19 at 11:50
  • @sa_zy - Tested again and found my solution was not correct, also another answer return wrong output. So solution was changed, please check it. – jezrael Sep 25 '19 at 08:03
1

You may also try to create mask on groupby of ID and or it with shift(-1) to flag all consecutive rows True and assign to mask s1. Finally, use np.where on s1 and s1.groupby.cumsum

s = df.groupby('ID').tDate.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['consec_count'] = np.where(s1, s1.groupby(df.ID).cumsum(), 0)

Out[185]:
   ID      tDate  value  consec_count
0  79 2019-06-21    397             0
1  79 2019-07-13    404             0
2  79 2019-07-18    405             1
3  79 2019-07-19    406             2
4  79 2019-08-02    410             0
5  79 2019-08-09    413             0
Andy L.
  • 24,909
  • 4
  • 17
  • 29