0

Essentially I have data which provides a start time, the number of time slots and the duration of each slot. I want to convert that into a dataframe of start and end times - which I've achieved but I can't help but think is not efficient or particularly pythonic. The real data has multiple ID's hence the grouping.

import pandas as pd

slots = pd.DataFrame({"ID": 1, "StartDate": pd.to_datetime("2019-01-01 10:30:00"), "Quantity": 3, "Duration": pd.to_timedelta(30, unit="minutes")}, index=[0])
grp_data = slots.groupby("ID")

bob = []

for rota_id, row in grp_data:
    start = row.iloc[0, 1]
    delta = row.iloc[0, 3]
    for quantity in range(1, int(row.iloc[0, 2] + 1)):
        data = {"RotaID":    rota_id,
                "DateStart": start,
                "Duration":  delta,
                "DateEnd":   start+delta}

        bob.append(data)
        start = start + delta

fred = pd.DataFrame(bob)

This might be answered elsewhere but I've no idea how to properly search this since I'm not sure what my problem is.

EDIT: I've updated my code to be more efficient with it's function calls and it is faster, but I'm still interested in knowing if there is a vectorised approach to this.

  • Do you have a more concrete / less opinion-centric question than one that corresponds around "pythonic" code? See [Are Pythonic questions opinion-based?](https://meta.stackexchange.com/questions/191671/are-pythonic-questions-opinion-based) on [meta.se], holding that a question is valid *as long as there's still substance there with the "pythonic" criteria removed*. I've made a guess at what that remaining substance would be and tried to edit towards that end; your review of that edit would be appreciated. – Charles Duffy Jan 23 '20 at 16:08
  • It looks like you're doing a lot of calls to `append`. From what I've seen (and what [others suggest](https://stackoverflow.com/a/15822811/4739755)), `concat` is almost equivalent to `append`, but would allow you do build up a list of dataframes, and then concatenate them all together in a single call. – b_c Jan 23 '20 at 16:10
  • @CharlesDuffy Thanks for the edit, this is exactly what I wanted to ask. Thank you. Pythonic was a poor way to phrase this – PatientSnake Jan 23 '20 at 16:16

2 Answers2

1

How about this way:

indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]

# Add a counter per ID; used to 'shift' the duration along StartDate
slots_ext['counter'] = slots_ext.groupby('ID').cumcount()

# Calculate DateStart and DateEnd based on counter and Duration
slots_ext['DateStart'] = (slots_ext.counter) * slots_ext.Duration.values + slots_ext.StartDate
slots_ext['DateEnd'] = (slots_ext.counter + 1) * slots_ext.Duration.values + slots_ext.StartDate

slots_ext.loc[:, ['ID', 'DateStart', 'Duration', 'DateEnd']].reset_index(drop=True)

Performance
Looking at performance on a larger dataframe (duplicated 1000 times) using

slots_large = pd.concat([slots] * 1000, ignore_index=True).drop('ID', axis=1).reset_index().rename(columns={'index': 'ID'})

Yields:
Old method: 289 ms ± 4.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
New method: 8.13 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

RubenB
  • 525
  • 3
  • 10
  • This is a very interesting way to look at it, and although I have yet to test it myself - based on your benchmarks this looks to be much faster. Thank you! – PatientSnake Jan 25 '20 at 21:59
0

In case this ever helps anyone: I found that my data set had varying delta's per ID and @RubenB 's initial answer doesn't handle those. Here was my final solution based on his/her code:

# RubenB's code
indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]

# Calculate the cumulative sum of the delta per rota ID
slots_ext["delta_sum"] = slots_ext.groupby("ID")["Duration"].cumsum()
slots_ext["delta_sum"] = pd.to_timedelta(slots_ext["delta_sum"], unit="minutes")

# Use the cumulative sum to calculate the running end dates and then the start dates
first_value = slots_ext.StartDate[0]
slots_ext["EndDate"] = slots_ext.delta_sum.values + slots_ext.StartDate
slots_ext["StartDate"] = slots_ext.EndDate.shift(1)
slots_ext.loc[0, "StartDate"] = first_value
slots_ext.reset_index(drop=True, inplace=True)