3

I had a df such as

ID  | Half Hour Bucket | clock in time  | clock out time  | Rate
232 | 4/1/19 8:00 PM   | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54
342 | 4/1/19 8:30 PM   | 4/1/19 7:12 PM | 4/1/19 7:22 PM  | 0.23
232 | 4/1/19 7:00 PM   | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54

I want my output to be

 ID | Half Hour Bucket | clock in time  | clock out time  | Rate | Mins
232 | 4/1/19 8:00 PM   | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54 |
342 | 4/1/19 8:30 PM   | 4/1/19 7:12 PM | 4/1/19 7:22 PM  | 0.23 |
232 | 4/1/19 7:00 PM   | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54 |

Where minutes represents the difference between clock out time and clock in time.

But I can only contain the minutes value for the half hour bucket on the same row it corresponds to.

For example for id 342 it would be ten minutes and the 10 mins would be on that row.

But for ID 232 the clock in to clock out time spans 3 hours. I would only want the 30 mins for 8 to 830 in the first row and the 18 mins in the third row. for the minutes in the half hour bucket like 830-9 or 9-930 that dont exist in the first row, I would want to create a new row in that same df that contains nans for everything except the half hour bucket and mins field for the minutes that do not exist in the original row.

the 30 mins from 8-830 would stay in the first row, but I would want 5 new rows for all the half hour buckets that aren't 4/1/19 8:00 PM as new rows with only the half hour bucket and the rate carrying over from the row. Is this possible?

I thank anyone for their time!

Chris90
  • 1,868
  • 5
  • 20
  • 42
  • 1
    You should add another dataframe which represents your expected output, which will clears things up a lot. – Erfan May 10 '19 at 16:10

1 Answers1

1

Realised my first answer probably wasn't what you wanted. This version, hopefully, is. It was a bit more involved than I first assumed!

Create Data

First of all create a dataframe to work with, based on that supplied in the question. The resultant formatting isn't quite the same but that would be easily fixed, so I've left it as-is here.

import math
import numpy as np
import pandas as pd

# Create a dataframe to work with from the data provided in the question
columns = ['id', 'half_hour_bucket', 'clock_in_time', 'clock_out_time' , 'rate']

data = [[232, '4/1/19 8:00 PM', '4/1/19 7:12 PM', '4/1/19 10:45 PM', 0.54],
        [342, '4/1/19 8:30 PM', '4/1/19 7:12 PM', '4/1/19 07:22 PM ', 0.23],
        [232, '4/1/19 7:00 PM', '4/1/19 7:12 PM', '4/1/19 10:45 PM', 0.54]]

df = pd.DataFrame(data, columns=columns)

def convert_cols_to_dt(df):
    # Convert relevant columns to datetime format
    for col in df:
        if col not in ['id', 'rate']:
            df[col] = pd.to_datetime(df[col])

    return df

df = convert_cols_to_dt(df)
# Create the mins column
df['mins'] = (df.clock_out_time - df.clock_in_time)

Output:

  id  half_hour_bucket    clock_in_time       clock_out_time      rate mins
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 0 days 03:33:00.000000000
1 342 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 0 days 00:10:00.000000000
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 0 days 03:33:00.000000000

Solution

Next define a simple function to return a list of length equal to the number of 30-minute intervals in the min column.

def upsample_list(x):
    multiplier = math.ceil(x.total_seconds() / (60 * 30))

    return list(range(multiplier))

And apply this to the dataframe:

df['samples'] = df.mins.apply(upsample_list)

Next, create a new row for each list item in the 'samples' column (using the answer provided by Roman Pekar here):

s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'sample'

Join s to the dataframe and clean up the extra columns:

df = df.drop('samples', axis=1).join(s, how='inner').drop('sample', axis=1)

Which gives us this:

   id   half_hour_bucket    clock_in_time        clock_out_time       rate  mins
0  232  2019-04-01 20:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
0  232  2019-04-01 20:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
0  232  2019-04-01 20:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
0  232  2019-04-01 20:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
0  232  2019-04-01 20:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
0  232  2019-04-01 20:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
0  232  2019-04-01 20:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
0  232  2019-04-01 20:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
1  342  2019-04-01 20:30:00 2019-04-01 19:12:00  2019-04-01 19:22:00  0.23  00:10:00
2  232  2019-04-01 19:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
2  232  2019-04-01 19:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
2  232  2019-04-01 19:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
2  232  2019-04-01 19:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
2  232  2019-04-01 19:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
2  232  2019-04-01 19:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
2  232  2019-04-01 19:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00
2  232  2019-04-01 19:00:00 2019-04-01 19:12:00  2019-04-01 22:45:00  0.54  03:33:00

Nearly there!

Reset the index:

df = df.reset_index(drop=True)

Set duplicate rows to NaN:

df = df.mask(df.duplicated())

Which gives:

   id    half_hour_bucket    clock_in_time       clock_out_time      rate mins
0  232.0 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
1  NaN   NaT                 NaT                 NaT                 NaN  NaT
2  NaN   NaT                 NaT                 NaT                 NaN  NaT
3  NaN   NaT                 NaT                 NaT                 NaN  NaT
4  NaN   NaT                 NaT                 NaT                 NaN  NaT
5  NaN   NaT                 NaT                 NaT                 NaN  NaT
6  NaN   NaT                 NaT                 NaT                 NaN  NaT
7  NaN   NaT                 NaT                 NaT                 NaN  NaT
8  342.0 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 00:10:00
9  232.0 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
10 NaN   NaT                 NaT                 NaT                 NaN  NaT
11 NaN   NaT                 NaT                 NaT                 NaN  NaT
12 NaN   NaT                 NaT                 NaT                 NaN  NaT
13 NaN   NaT                 NaT                 NaT                 NaN  NaT
14 NaN   NaT                 NaT                 NaT                 NaN  NaT
15 NaN   NaT                 NaT                 NaT                 NaN  NaT
16 NaN   NaT                 NaT                 NaT                 NaN  NaT

Lastly, forward fill the half_hour_bucket and rate columns.

df[['half_hour_bucket', 'rate']] = df[['half_hour_bucket', 'rate']].ffill()

Final output:

     id     half_hour_bucket     clock_in_time        clock_out_time       rate  mins
0    232.0  2019-04-01 20:00:00  2019-04-01_19:12:00  2019-04-01_22:45:00  0.54  03:33:00
1    NaN    2019-04-01 20:00:00  NaT                  NaT                  0.54  NaT
2    NaN    2019-04-01 20:00:00  NaT                  NaT                  0.54  NaT
3    NaN    2019-04-01 20:00:00  NaT                  NaT                  0.54  NaT
4    NaN    2019-04-01 20:00:00  NaT                  NaT                  0.54  NaT
5    NaN    2019-04-01 20:00:00  NaT                  NaT                  0.54  NaT
6    NaN    2019-04-01 20:00:00  NaT                  NaT                  0.54  NaT
7    NaN    2019-04-01 20:00:00  NaT                  NaT                  0.54  NaT
8    342.0  2019-04-01 20:30:00  2019-04-01_19:12:00  2019-04-01_19:22:00  0.23  00:10:00
9    232.0  2019-04-01 19:00:00  2019-04-01_19:12:00  2019-04-01_22:45:00  0.54  03:33:00
10   NaN    2019-04-01 19:00:00  NaT                  NaT                  0.54  NaT
11   NaN    2019-04-01 19:00:00  NaT                  NaT                  0.54  NaT
12   NaN    2019-04-01 19:00:00  NaT                  NaT                  0.54  NaT
13   NaN    2019-04-01 19:00:00  NaT                  NaT                  0.54  NaT
14   NaN    2019-04-01 19:00:00  NaT                  NaT                  0.54  NaT
15   NaN    2019-04-01 19:00:00  NaT                  NaT                  0.54  NaT
16   NaN    2019-04-01 19:00:00  NaT                  NaT                  0.54  NaT
Chris
  • 1,618
  • 13
  • 21