2

I have the following dataframe:

import pandas as pd
    dt = pd.DataFrame({'start_date': ['2019-05-20', '2019-05-21', '2019-05-21'],
                       'end_date': ['2019-05-23', '2019-05-24', '2019-05-22'],
                       'reg': ['A', 'B','A'],
                       'measure': [100, 200,1000]})

I would to create a new column, called 'date', which will have values from start_date until end_date and also have a new column measure_daily which will be the measure spread equally among these dates.

So basically, I would like to expand the dt in terms of rows

So I would like the final df to look like:

   dt_f = pd.DataFrame({'date':['2019-05-20','2019-05-21','2019-05-22','2019-05-23','2019-05-21','2019-05-22','2019-05-23','2019-05-24', '2019-05-21','2019-05-22'],
                            'reg':['A','A','A','A','B','B','B','B','A','A'],
                             'measure_daily':[25,25,25,25,50,50,50,50,500,500]})

Is there an efficient way to do this in python ?

quant
  • 4,062
  • 5
  • 29
  • 70

1 Answers1

4

TL;DR

just give me the solution:

dt = dt.assign(key=dt.index)
melt = dt.melt(id_vars = ['reg', 'measure', 'key'], value_name='date').drop('variable', axis=1)

melt = pd.concat(
    [d.set_index('date').resample('d').first().ffill() for _, d in melt.groupby(['reg', 'key'], sort=False)]
).reset_index()

melt.assign(measure = melt['measure'].div(melt.groupby(['reg', 'key'], sort=False)['reg'].transform('size'))).drop('key', axis=1)

Breakdown:

First we melt your start and end date to the same column:

dt = dt.assign(key=dt.index)
melt = dt.melt(id_vars = ['reg', 'measure', 'key'], value_name='date').drop('variable', axis=1)

  reg  measure  key       date
0   A      100    0 2019-05-20
1   B      200    1 2019-05-21
2   A     1000    2 2019-05-21
3   A      100    0 2019-05-23
4   B      200    1 2019-05-24
5   A     1000    2 2019-05-22

Then we resample on daily basis while applying groupby to keep the different reg in their own group.

melt = pd.concat(
    [d.set_index('date').resample('d').first().ffill() for _, d in melt.groupby(['reg', 'key'], sort=False)]
).reset_index()

        date reg  measure  key
0 2019-05-20   A    100.0  0.0
1 2019-05-21   A    100.0  0.0
2 2019-05-22   A    100.0  0.0
3 2019-05-23   A    100.0  0.0
4 2019-05-21   B    200.0  1.0
5 2019-05-22   B    200.0  1.0
6 2019-05-23   B    200.0  1.0
7 2019-05-24   B    200.0  1.0
8 2019-05-21   A   1000.0  2.0
9 2019-05-22   A   1000.0  2.0

Finally we spread out the measure column over the size of each group with assign:

melt.assign(measure = melt['measure'].div(melt.groupby(['reg', 'key'], sort=False)['reg'].transform('size'))).drop('key', axis=1)

        date reg  measure
0 2019-05-20   A     25.0
1 2019-05-21   A     25.0
2 2019-05-22   A     25.0
3 2019-05-23   A     25.0
4 2019-05-21   B     50.0
5 2019-05-22   B     50.0
6 2019-05-23   B     50.0
7 2019-05-24   B     50.0
8 2019-05-21   A    500.0
9 2019-05-22   A    500.0
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • do you think, there is a way to expand this solution, in case you had an overlap in dates but within the same region ? – quant Jul 26 '19 at 11:30
  • How would overlapping dates look like within the same region? You mean that somewhere further on the dataset `region A` will show again with overlapping dates? @quant – Erfan Jul 26 '19 at 11:33
  • yes, exactly. @killswitch solution i think does that – quant Jul 26 '19 at 11:35
  • Last question, does it happen that a new `reg` down the dataframe has the same `measure` value as the same reg but then further up the dataframe? @quant – Erfan Jul 26 '19 at 11:47
  • Last edit, this should give correct output and should run fast on big data frame – Erfan Jul 26 '19 at 11:54