8

I have a dataframe:

df1 = pd.DataFrame(
    [['2011-01-01','2011-01-03','A'], ['2011-04-01','2011-04-01','A'], ['2012-08-28','2012-08-30','B'], ['2015-04-03','2015-04-05','A'], ['2015-08-21','2015-08-21','B']],
    columns=['d0', 'd1', 'event'])
           d0          d1 event
0  2011-01-01  2011-01-03     A
1  2011-04-01  2011-04-01     A
2  2012-08-28  2012-08-30     B
3  2015-04-03  2015-04-05     A
4  2015-08-21  2015-08-21     B

It contains some events A and B that occurred in the specified interval from d0 to d1. (There are actually more events, they are mixed, but they have no intersection by dates.) Moreover, this interval can be 1 day (d0 = d1). I need to go from df1 to df2 in which these time intervals are "unrolled" for each event, i.e.:

df2 = pd.DataFrame(
    [['2011-01-01','A'], ['2011-01-02','A'], ['2011-01-03','A'], ['2011-04-01','A'], ['2012-08-28','B'], ['2012-08-29','B'], ['2012-08-30','B'], ['2015-04-03','A'], ['2015-04-04','A'], ['2015-04-05','A'], ['2015-08-21','B']],
    columns=['Date', 'event'])
          Date event
0   2011-01-01     A
1   2011-01-02     A
2   2011-01-03     A
3   2011-04-01     A
4   2012-08-28     B
5   2012-08-29     B
6   2012-08-30     B
7   2015-04-03     A
8   2015-04-04     A
9   2015-04-05     A
10  2015-08-21     B

I tried doing this based on resample and comparing areas where ffill = bfill but couldn't come up with anything. How can this be done in the most simple way?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Антон
  • 81
  • 2
  • Welcome to Stack Overflow! Please take the [tour]. I edited the title to try to make it clearer. You can [edit] it further if you'd like. See [ask] for tips on that and other things. – wjandrea Nov 07 '21 at 18:49

3 Answers3

7

We can set_index to event then create date_range per row, then explode to unwind the ranges and reset_index to create the DataFrame:

df2 = (
    df1.set_index('event')
        .apply(lambda r: pd.date_range(r['d0'], r['d1']), axis=1)
        .explode()
        .reset_index(name='Date')[['Date', 'event']]
)

df2:

         Date event
0  2011-01-01     A
1  2011-01-02     A
2  2011-01-03     A
3  2011-04-01     A
4  2012-08-28     B
5  2012-08-29     B
6  2012-08-30     B
7  2015-04-03     A
8  2015-04-04     A
9  2015-04-05     A
10 2015-08-21     B
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Do you use `set_index` just to "push aside" the other column temporarily? – creanion Oct 31 '21 at 13:43
  • `explode` will duplicate whatever values are in the index. So it's in there so when the rows are expanded we don't lose the `event` information associations. – Henry Ecker Oct 31 '21 at 13:44
  • 1
    Explode can also duplicate column values. Bringing the event through the `apply` by setting it as index, seems like a nice trick! But in big dataframe (many columns), it looks like there's a different way to do it too (apply and set as an additional column, explode from there, then clean up). I learned something from your answer. Thanks – creanion Oct 31 '21 at 14:21
  • Yes. `DataFrame.explode` can be used to scale up an entire DataFrame. But as you mention that would require (in this case) dropping 2 columns and adding one. Which can all be avoided just by using `Series.explode` after `set_index` here. – Henry Ecker Oct 31 '21 at 14:29
  • FWIW, [here's a variation](https://gist.github.com/wjandrea/4fb05d2a978cf31dc77f9f11f2b5cc6e) that seems more intuitive to me: create the `Date` column first and explode it, then concat with `event`. You could explode after, but this saves writing the column name again. I don't know if this has any specific pros and cons cause I'm not very deep into Pandas. Also I did `*r[['d0', 'd1']]` instead of `r['d0'], r['d1']`. – wjandrea Nov 07 '21 at 20:46
  • 1
    @wjandrea So. `r[['d0', 'd1']]` returns a `Series` which is indexed. There a significant amount of overhead to find build, align, and copy a `Series` _then_ upack it that is simply not present in Selecting individual values from a `Series` which returns scalers. This adds a non-negligible amount of overhead to an already slow operation. Unpacking is more reasonable in the pure python context then the Pandas Object context for this reason. Having said that, however, the `concat` portion is definitely a better option in the case that you had lots of columns to scale up. – Henry Ecker Nov 07 '21 at 22:38
6

Let us try comprehension to create the pairs of date and event

pd.DataFrame(((d, c) for (*v, c) in df1.to_numpy()
              for d in pd.date_range(*v)), columns=['Date', 'Event'])

         Date Event
0  2011-01-01     A
1  2011-01-02     A
2  2011-01-03     A
3  2011-04-01     A
4  2012-08-28     B
5  2012-08-29     B
6  2012-08-30     B
7  2015-04-03     A
8  2015-04-04     A
9  2015-04-05     A
10 2015-08-21     B
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
1

I don't know if this is the "most simple," but it's the most intuitive way I can think to do it. I iterate over the rows and unroll it manually into a new dataframe. This means that I look at each row, iterate over the dates between d0 and d1, and construct a row for each of them and compile them into a dataframe:

from datetime import timedelta

def unroll_events(df):
    rows = []
    for _, row in df.iterrows():
        event = row['event']
        start = row['d0']
        end = row['d1']
        current = start
        while current != end:
            rows.append(dict(Date=current, event=event))
            current += timedelta(days=1)
        rows.append(dict(Date=current, event=event)) # make sure last one is included
    return pd.DataFrame(rows)
dsillman2000
  • 976
  • 1
  • 8
  • 20
  • 1
    Thanks! Your solution is very good! current = start.copy() don't work with timestamp and i delete "copy()" - all wokred. Thanks a lot! – Антон Oct 31 '21 at 13:28