2

I have some data that looks like:

df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'],
                             "val": [9, 2, 4, 7, 6, 3, 2],
                             "dates": [pd.Timestamp(2020, 1, 1),
                                       pd.Timestamp(2020, 1, 1),
                                       pd.Timestamp(2020, 1, 2),
                                       pd.Timestamp(2020, 1, 2),
                                       pd.Timestamp(2020, 1, 3),
                                       pd.Timestamp(2020, 1, 5),
                                       pd.Timestamp(2020, 3, 12)]})

I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record.

For example the final output should be:

df_end_result = pd.DataFrame({"id": [102, 102, 103, 103, 104], "var1": ['a', 'b', 'b', 'a', 'c'],
                              "val": [13, 2, 13, 3, 2],
                              "start_date": [pd.Timestamp(2020, 1, 1),
                                             pd.Timestamp(2020, 1, 1),
                                             pd.Timestamp(2020, 1, 2),
                                             pd.Timestamp(2020, 1, 5),
                                             pd.Timestamp(2020, 3, 12)],
                              "end_date": [pd.Timestamp(2020, 1, 2),
                                           pd.Timestamp(2020, 1, 1),
                                           pd.Timestamp(2020, 1, 3),
                                           pd.Timestamp(2020, 1, 5),
                                           pd.Timestamp(2020, 3, 12)]})

I have tried this a few ways and keep failing, the length of time that something can exist for is unknown and the possible number of var1 can change with each id and with date window as well.

For example I have tried to identify consecutive days like this, but it always returns ['count_days'] == 0 (clearly something is wrong!). Then I thought I could take date(min) and date(min)+count_days to get 'start_date' and 'end_date'

s = df_raw_dates.groupby(['id','var1']).dates.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['count_days'] = np.where(s1, s1.groupby(df.id).cumsum(), 0)

I have also tried:

df = df_raw_dates.groupby(['id', 'var1']).agg({'val': 'sum', 'date': ['first', 'last']}).reset_index()

Which gets me closer, but I don't think this deals with the consecutive days problem but instead provides the earliest and latest day which unfortunately isn't something that I can take forward.

EDIT: adding more context

Another approach is:

df = df_raw_dates.groupby(['id', 'dates']).size().reset_index().rename(columns={0: 'del'}).drop('del', axis=1)

which provides a list of ids and dates, but I am getting stuck with finding min max consecutive dates within this new window


Extended example that has a break in the date range for group (102,'a').

df_raw_dates = pd.DataFrame(
    {
        "id": [102, 102, 102, 103, 103, 103, 104, 102, 102, 102, 102, 108, 108],
        "var1": ["a", "b", "a", "b", "b", "a", "c", "a", "a", "a", "a", "a", "a"],
        "val": [9, 2, 4, 7, 6, 3, 2, 1, 2, 3, 4, 99, 99],
        "dates": [
            pd.Timestamp(2020, 1, 1),
            pd.Timestamp(2020, 1, 1),
            pd.Timestamp(2020, 1, 2),
            pd.Timestamp(2020, 1, 2),
            pd.Timestamp(2020, 1, 3),
            pd.Timestamp(2020, 1, 5),
            pd.Timestamp(2020, 3, 12),
            pd.Timestamp(2020, 1, 3),
            pd.Timestamp(2020, 1, 7),
            pd.Timestamp(2020, 1, 8),
            pd.Timestamp(2020, 1, 9),
            pd.Timestamp(2020, 1, 21),
            pd.Timestamp(2020, 1, 25),
        ],
    }
)

Further example

This is using the anwser below from wwii

import pandas as pd
import collections

df_raw_dates1 = pd.DataFrame(
    {
        "id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
        "var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
        "val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
        "dates": [
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18)

        ],
    }
)

day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)

for k,g in gb:
    # print(g)
    eyed, var1 = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(g.val.sum())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(), date_range['dates'].max()
        val = date_range.val.sum()
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)

print(pd.DataFrame(new_df))

>>>    id var1   val      start        end
0   100    a   0.0 2021-01-22 2021-01-22
1   100    b   0.0 2021-01-22 2021-01-22
2   100    d   0.0 2021-01-22 2021-01-22

3   105    a   0.0 2021-01-22 2021-01-22
4   105    a   1.0 2021-01-21 2021-01-21
5   105    a   0.0 2021-01-20 2021-01-20
6   105    a  10.0 2021-01-19 2021-01-19

7   105    b   2.0 2021-01-22 2021-01-22
8   105    b   9.0 2021-01-21 2021-01-21
9   105    b   5.0 2021-01-20 2021-01-20
10  105    b  13.0 2021-01-19 2021-01-19

From the above I would have expected the rows 3,4,5,6 to be grouped together and 7,8,9,10 also. I am not sure why this example now breaks?

Not sure what the difference with this example and the extended example above is and why this seems to not work?

Adam
  • 23
  • 4
  • It might help to mention/show your failed attempts and explain why the result didn't work - maybe you were close and one of them just needs a tweak. – wwii Jan 22 '21 at 16:59
  • You are trying to do a number of things - which one in particular are you having the most trouble with and are asking about in this question? For example - `how do I split a series into consecutive data ranges?` - We usually expect one question per question and frown upon *please implement this for me*. Please take the [tour] and read [ask] and the other links forund on that page. – wwii Jan 22 '21 at 17:10
  • thanks @wwii I added in some more context to the question, sorry this was not in the correct format only my first time adding something here :) – Adam Jan 22 '21 at 17:23
  • Related:[How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas](https://stackoverflow.com/questions/40118037/how-can-i-detect-gaps-and-consecutive-periods-in-a-time-series-in-pandas), [Detect consecutive dates in pandas series of DatetimeIndex](https://stackoverflow.com/questions/27679665/detect-consecutive-dates-in-pandas-series-of-datetimeindex), – wwii Jan 22 '21 at 17:23
  • @wwii I have added some more context of exactly where I am stuck and what I have tried, there are a few approaches there, don't know if this helps or makes it more confusing. I hope helps, thanks in advance Adam – Adam Jan 22 '21 at 18:10
  • You really should not continually morph your questions based on answers you receive. If the answers are not sufficient but solve some of your problems, don't accept them but feel free to ask a new question(s) using those answers (remember to include an attribution). We typically ask for one question per question. It seems my answer partially solved your problem but may have created others. Play with my answer and see if you can adapt it - if not ask another question. – wwii Jan 23 '21 at 17:18

1 Answers1

1

I don't have Pandas superpowers so I never try to do groupby one-liners, maybe someday.

Adapting the accepted answer to SO question Find group of consecutive dates in Pandas DataFrame - first group by ['id','var1']; for each group group by consecutive date ranges.

import pandas as pd
sep = "************************************\n"
day = pd.Timedelta('1d')
# using the extended example in the question.
gb = df_raw_dates.groupby(['id', 'var1'])

for k,g in gb:
    print(g)
    dt = g['dates']
    # find difference in days between rows
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)

    # create a Series to identify consecutive ranges to group by
    # this cumsum trick can be found in many SO answers
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    # split into date ranges
    date_groups = g.groupby(groups)
    for _,date_range in date_groups:
        print(date_range)
    print(sep)

You can see that the (102,'a') group has been split into two groups.

    id var1  val      dates
0  102    a    9 2020-01-01
2  102    a    4 2020-01-02
7  102    a    1 2020-01-03
     id var1  val      dates
8   102    a    2 2020-01-07
9   102    a    3 2020-01-08
10  102    a    4 2020-01-09

Going a bit further: while iterating construct a dictionary to make a new DataFrame with.

import pandas as pd
import collections
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
    # print(g)
    eyed,var = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var)
        new_df['val'].append(g.val.mean())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(),date_range['dates'].max()
        val = date_range.val.mean()
        new_df['id'].append(eyed)
        new_df['var1'].append(var)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)


print(pd.DataFrame(new_df))

>>>
    id var1        val      start        end
0  102    a   4.666667 2020-01-01 2020-01-03
1  102    a   3.000000 2020-01-07 2020-01-09
2  102    b   2.000000 2020-01-01 2020-01-01
3  103    a   3.000000 2020-01-05 2020-01-05
4  103    b   6.500000 2020-01-02 2020-01-03
5  104    c   2.000000 2020-03-12 2020-03-12
6  108    a  99.000000 2020-01-21 2020-01-25

Seems pretty tedious, maybe someone will come along with a less-verbose solution. Maybe some of the operations could be put in functions and .apply or .transform or .pipe could be used making it a little cleaner.


It does not account for ('id','var1') groups that have more than one date but only single date ranges. e.g.

     id var1  val      dates
11  108    a   99 2020-01-21
12  108    a   99 2020-01-25

You might need to detect if there are any gaps in a datetime Series and use that fact to accommodate.

wwii
  • 23,232
  • 7
  • 37
  • 77
  • Not off the top of my head. fifty thousand rows doesn't sound like a lot. Multiple processes might help if you had to do this for multiple DataFrames. I'd have to study it to find efficiencies, I really wasn't focused on that. With Pandas you gain speed by operating on whole things - i solved your problem by finding groups within groups which never *sounds good* but I was just focused on finding a solution. When I decided to just build a dictionary first I kinda didn't like that so maybe there is something there. Also, I just picked the first viable solution for finding date ranges. – wwii Jan 23 '21 at 15:30
  • thanks, no I didn't think ti was a lot either but it felt pretty slow. I have also noticed that something is going awry with the example I just added into the Q but I can't spot why when the 'extended' example that you added in works in all cases but the listed example. I thought this would be due to val being 0 but this doesn't effect the solution – Adam Jan 23 '21 at 16:36