1

I have data that is in the format shown below:

    ID          qi        di            b       start_date   end_date    delta_t
0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      230
1  2445252  304.377500  1.000000  2.000000e+01 2020-07-09 2021-05-06      222
2  3323232   16.047443  0.017908  3.556858e-09 2020-07-10 2021-05-26      221
3  4444242  190.799229  0.162360  2.000000e+01 2020-07-11 2021-05-06      220
4  5555366  153.341044  0.000195  2.730887e-04 2020-07-01 2021-05-26      230
5  6343423  518.195900  0.000073  1.516531e+01 2020-07-12 2021-05-10      219

I would like to have something like the frame shown below where each ID is expanded to contain rows from 0 to delta_t.

     ID          qi        di            b       start_date   end_date    delta_t
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      0
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      1
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      2
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      3
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      4
...
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      230

Any help is appreciated!

lucusm
  • 65
  • 5
  • This [answer](https://stackoverflow.com/questions/49074021/repeat-rows-in-data-frame-n-times) can help you. – minzey Jun 23 '21 at 18:22

3 Answers3

2

One way is to first form a repeated indices of the frame where repetition amount is from delta_t column, and then loc with it. To reset the delta_t to 0..N for each group, we can use cumcount:

# +1 at the end is to include `N` in `0..N`
repeated_inds = df.index.repeat(repeats=df.delta_t + 1)

new_df = df.loc[repeated_inds]

new_df.delta_t = new_df.groupby("delta_t").cumcount()

to get

>>> new_df

         ID          qi        di         b  start_date    end_date  delta_t
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        0
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        1
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        2
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        3
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        4
..      ...         ...       ...       ...         ...         ...      ...
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      215
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      216
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      217
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      218
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      219

[1348 rows x 7 columns]

A sanity check is:

>>> df.delta_t.add(1).sum() == len(new_df)
True
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
2

I know this is a possible solution but there is likely a more pythonic way.

df['delta_t'] = df['delta_t'].transform(lambda x:  list(range(0, x + 1)))
df = df.explode('delta_t')

and then explode this newly created list column which will give the following desired dataframe

    ID          qi        di          b       start_date   end_date delta_t
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       0
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       1
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       2
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       3
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       4
..           ...         ...       ...        ...        ...        ...     ...
5   1232111  518.195900  0.000073  15.165307 2020-07-12 2021-05-10     215
lucusm
  • 65
  • 5
  • 1
    Thank you, I am relatively new to python so it can be a challenge to find the most efficient methods! – lucusm Jun 23 '21 at 18:40
0
extra_delta_ts = []
for row in df.itertuples():
    for i in range(row.delta_t):
        row_data = [row.ID, row.qi, row.di, row.b, row.start_date, row.end_date, i]   
        extra_delta_ts.append(row_data)

columns = ['ID', 'qi', 'di', 'b',  'start_date', 'end_date', 'delta_t']
extra_delta_ts_df = pd.DataFrame(extra_delta_ts, columns=columns)
concat_df = pd.concat([df, extra_delta_ts_df])
concat_df.sort_values(by=['ID', 'delta_t'], inplace=True)
chatax
  • 990
  • 3
  • 17