This is a follow on from a previous question:
Grouping data by id, var1 into consecutive dates in python using pandas
I have a data set with ids and values per day, I want to group all consecutive days together to make one line, if a day is missing I would like to start a new line.
This has worked for a simple example however for the example below the grouping fails and I am failing to spot why
import pandas as pd
import collections
df_raw_dates1 = pd.DataFrame(
{
"id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
"var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
"val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
"dates": [
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18)
],
}
)
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed, var1 = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(g.val.sum())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(), date_range['dates'].max()
val = date_range.val.sum()
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>> id var1 val start end
0 100 a 0.0 2021-01-22 2021-01-22
1 100 b 0.0 2021-01-22 2021-01-22
2 100 d 0.0 2021-01-22 2021-01-22
3 105 a 0.0 2021-01-22 2021-01-22
4 105 a 1.0 2021-01-21 2021-01-21
5 105 a 0.0 2021-01-20 2021-01-20
6 105 a 10.0 2021-01-19 2021-01-19
7 105 b 2.0 2021-01-22 2021-01-22
8 105 b 9.0 2021-01-21 2021-01-21
9 105 b 5.0 2021-01-20 2021-01-20
10 105 b 13.0 2021-01-19 2021-01-19