1

Imagine there is a dataframe:

   id        date  balance_total  transaction_total
0   1  01/01/2019          102.0               -1.0
1   1  01/02/2019          100.0               -2.0
2   1  01/03/2019          100.0                NaN
3   1  01/04/2019          100.0                NaN
4   1  01/05/2019           96.0               -4.0
5   2  01/01/2019          200.0               -2.0
6   2  01/02/2019          100.0               -2.0
7   2  01/04/2019          100.0                NaN
8   2  01/05/2019           96.0               -4.0

here is the create dataframe command:

import pandas as pd
import numpy as np

users=pd.DataFrame(
                [
                {'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
                {'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
                {'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
                {'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
                {'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
                {'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
                {'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
                {'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
                {'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}  
                ]
                )

How could I check if each id has consecutive dates or not? I use the "shift" idea here but it doesn't seem to work:

Calculating time difference between two rows

df['index_col'] = df.index

for id in df['id'].unique():

    # create an empty QA dataframe

    column_names = ["Delta"]
    df_qa = pd.DataFrame(columns = column_names)

    df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))

    if (df_qa['Delta'].iloc[1:] != 1).any() is True:

        print('id ' + id +' might have non-consecutive dates')

        # doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
    break

Ideal output:

it should print id 2 might have non-consecutive dates

Thank you!

Chubaka
  • 2,933
  • 7
  • 43
  • 58

3 Answers3

2

Use groupby and diff:

df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")

df["difference"] = df.groupby("id")["date"].diff()

print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])

#
   id       date  transaction_total  balance_total difference
7   2 2019-01-04                NaN          100.0     2 days
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
1

Use DataFrameGroupBy.diff with Series.dt.days, compre by greatee like 1 and filter only id column by DataFrame.loc:

users['date'] = pd.to_datetime(users['date'])

i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]

for val in i:
    print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

First step is to parse date:

users['date'] = pd.to_datetime(users.date).

Then add a shifted column on the id and date columns:

users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)

The difference between date and date_shifted columns is of interest:

>>> users.date - users.date_shifted

0       NaT
1    1 days
2    1 days
3    1 days
4    1 days
5   -4 days
6    1 days
7    2 days
8    1 days
dtype: timedelta64[ns]

You can now query the DataFrame for what you want:

users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]

That is, consecutive lines of the same user with a date difference != 1 day.

This solution does assume the data is sorted by (id, date).

Freek Wiekmeijer
  • 4,556
  • 30
  • 37
  • Thank you! I got an error of ```ValueError: cannot set a Timedelta with a non-timedelta``` in the last line ```users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]``` still trying to figure it out – Chubaka Feb 26 '20 at 05:54
  • i see what happened. the real id here is uuid instead of integers. Any thoughts? @Freek Wiekmeijer – Chubaka Feb 26 '20 at 06:03
  • We only do two things with the `id` column: shift (which is insensitive to the dtype) and compare `id == id_shifted` (which is also dtype insensitive problem because the two columns have the same dtype) – Freek Wiekmeijer Feb 26 '20 at 09:58