0

This is a follow on from a previous question:

Grouping data by id, var1 into consecutive dates in python using pandas

I have a data set with ids and values per day, I want to group all consecutive days together to make one line, if a day is missing I would like to start a new line.

This has worked for a simple example however for the example below the grouping fails and I am failing to spot why

import pandas as pd
import collections

df_raw_dates1 = pd.DataFrame(
    {
        "id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
        "var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
        "val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
        "dates": [
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18)

        ],
    }
)

day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)

for k,g in gb:
    # print(g)
    eyed, var1 = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(g.val.sum())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(), date_range['dates'].max()
        val = date_range.val.sum()
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)

print(pd.DataFrame(new_df))

>>>    id var1   val      start        end
0   100    a   0.0 2021-01-22 2021-01-22
1   100    b   0.0 2021-01-22 2021-01-22
2   100    d   0.0 2021-01-22 2021-01-22

3   105    a   0.0 2021-01-22 2021-01-22
4   105    a   1.0 2021-01-21 2021-01-21
5   105    a   0.0 2021-01-20 2021-01-20
6   105    a  10.0 2021-01-19 2021-01-19

7   105    b   2.0 2021-01-22 2021-01-22
8   105    b   9.0 2021-01-21 2021-01-21
9   105    b   5.0 2021-01-20 2021-01-20
10  105    b  13.0 2021-01-19 2021-01-19
Adam
  • 23
  • 4

2 Answers2

1

Try following solution:

# create temporary Group ID (to make following steps more clear)
df['gid'] = df['var1'] + df['id'].astype(str)
# sort values
df.sort_values(['gid', 'dates'], inplace=True)
# calculate difference between days
# (there are some differences between different groups, but they are irrelevant)
df['diff'] = df.groupby(['gid'])['dates'].diff().dt.days

# walk through all gid groups to create groups of consecutive dates within gid 
for g in df.groupby(['gid']):
    e = 0
    for r in g[1].itertuples():
        # when diff > 1 then dates are not consecutive -> increment counter for new group
        if r.diff and r.diff > 1:
            e += 1
        df.loc[r.Index, 'gid'] = r.gid + str(e)
    
# end dates are max values within new gid groups
df['end_dates'] = df.groupby(['gid'])['dates'].transform(max)
# rename dates column
df.rename({'dates': 'start_dates'}, axis=1, inplace=True)
# sum up val column within new gid groups
df['val'] = df.groupby(['gid'])['val'].transform(sum)
# remove all duplicate rows - first row in each gid group contains 
# correct start and end dates, others are irrelevant
df.drop_duplicates(['gid'], keep='first', inplace=True)
# remove all temporary columns (if needed)
df.drop(['gid', 'diff'], axis=1, inplace=True)

df.sort_values(['var1'], inplace=True)
df.reset_index(inplace=True, drop=True)

Output:

id  var1    val start_dates end_dates
0   100 a   0   2021-01-22  2021-01-22
1   100 b   0   2021-01-21  2021-01-22
2   100 d   0   2021-01-22  2021-01-22
3   105 a   13  2021-01-18  2021-01-22
4   105 b   44  2021-01-18  2021-01-22
5   105 c   1   2021-01-18  2021-01-18
6   105 c   1   2021-01-20  2021-01-22
7   105 d   22  2021-01-18  2021-01-22
Lukas
  • 2,034
  • 19
  • 27
0

First of all, the answer to your previous question was adapted to the fact that the dates where increasing inside each (id,var1) couples. You can fix this by using the following line :

df_raw_dates1=df_raw_dates1.sort_values(by=['id','var1','dates'])

Once, this step is done. You still have an issue with (id=105,var1=c) because the days are not consecutive. I fixed this by deleting the use of in_block and filter. The beginning of you for loop looks like this :

for k,g in gb:
    # print(g)
    eyed, var1 = k
    dt = g['dates']
    breaks = dt.diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    ...

With these modifications, I get the following output :

     id var1  val      start        end
0  100    a    0 2021-01-22 2021-01-22
1  100    b    0 2021-01-21 2021-01-22
2  100    d    0 2021-01-22 2021-01-22
3  105    a   13 2021-01-18 2021-01-22
4  105    b   44 2021-01-18 2021-01-22
5  105    c    1 2021-01-18 2021-01-18
6  105    c    1 2021-01-20 2021-01-22
7  105    d   22 2021-01-18 2021-01-22
Romain P.
  • 119
  • 1
  • 8