0

I have data records for different entities, and for each entity some count recorded in a specific hour during the day for a whole month. For example:

     entity_id    time              counts
0      175  2019-03-01 05:00:00       3
1      175  2019-03-01 06:00:00       4
2      175  2019-03-01 07:00:00       6
3      175  2019-03-01 08:00:00       6
4      175  2019-03-01 09:00:00       7
5      178  2019-03-01 05:00:00       8
6      178  2019-03-01 06:00:00       4
7      178  2019-03-01 07:00:00       5
8      178  2019-03-01 08:00:00       6
9      200  2019-03-01 05:00:00       7
10     200  2019-03-01 08:00:00       3
11     175  2019-03-03 05:00:00       3
12     175  2019-03-03 07:00:00       6
13     175  2019-03-03 08:00:00       6
14     175  2019-03-03 09:00:00       7
15     178  2019-03-03 05:00:00       8
16     178  2019-03-03 06:00:00       4
17     178  2019-03-03 07:00:00       5
18     178  2019-03-03 08:00:00       6
19     200  2019-03-03 05:00:00       7
20     200  2019-03-03 08:00:00       3
21     200  2019-03-03 09:00:00       7
...

I want to be able to aggregate for each entity the mean of the counts in several ranges of hours in different days of the week throughout the month. E.g.:

  • The mean for Morning (6-10AM) on Sundays
  • The mean for Morning (6-10AM) on Sundays-Thursdays
  • The mean for Noon (11AM-1PM) on Sundays-Thursdays
  • The mean for Noon (11AM-1PM) on Fri-Sat
  • The mean for Evening (6PM-9PM) on Fri
  • etc.

So I wish to get a df like this (partial example):

     entity_id day_in_week time_in_day counts_mean
0      175     sun         eve         5
1      175     sun-thu     noon        6
2      178     sun         eve         5
3      178     sat         eve         5
4      200     sun-thu     morning     2
...

I managed to get this partially done by iterating over the data, slicing and extracting different elements, but I assume there's a much more efficient way.

I started with this issue, but I still had too many for loops. Any ideas how to optimize the performance?

Shakedk
  • 420
  • 6
  • 15

2 Answers2

1

If your time column is a datetime object in pandas, you can create new columns using the datatime methods,

You can follow these steps,

  1. you can create a column indicating day_in_week as,
df["day_in_week"] = df["time"].dt.dayofweek
  1. Then use a simple .apply function to make columns as per your requirement to divide the time into slots morning, evening etc by comparing the time inside the function.

  2. Then create another column indicating your combinations based on the two columns created before.

  3. Then use groupby on the columns you want to get the grouped data or metrics for that group.

I know this process is a bit long, but it does not have any for loops, this uses the df.apply and datetime properties that pandas already provides and some if-else conditions as per your requirements.

Steps 2,3,4 are completely data dependant since I do not have the data, I cannot write the exact code. I tried my best to explain the methods that can be used.

I hope this helps.

yanarp
  • 165
  • 8
  • I'll give it a go :) – Shakedk May 17 '20 at 18:59
  • If I follow step 3, then I would get a column as such: "3_morning" "0_morning" etc. Let's say they I want to group these 2 rows 3 times: once for day 0, once for day 3 and once for both days together - how can I express that in a groupby? – Shakedk May 17 '20 at 19:50
  • This can help! https://stackoverflow.com/questions/17679089/pandas-dataframe-groupby-two-columns-and-get-counts – yanarp May 18 '20 at 08:40
1

The idea of my solution is based on an auxiliary DataFrame with definitions of ranges, for which means are to be computed (day_in_week, time_in_day and respective CustomBusinessHour for the above attributes).

Creation of this DataFrame (I called it calendars) starts from day_in_week, time_in_day columns:

calendars = pd.DataFrame([
    ['sun',     'morning'],
    ['sun-thu', 'morning'],
    ['sun-thu', 'noon'],
    ['fri-sat', 'noon'],
    ['fri',     'eve']],
    columns=['day_in_week', 'time_in_day'])

If you want more such definitions, add them here.

Then, to add corresponding CustomBusinessHour objects:

  1. Define a function to get hour limits:

    def getHourLimits(name):
        if name == 'morning':
            return '06:00', '10:00'
        elif name == 'noon':
            return '11:00', '13:00'
        elif name == 'eve':
            return '18:00', '21:00'
        else:
            return '8:00', '16:00'
    
  2. Define a function to get week mask (start hour and end hour):

    def getWeekMask(name):
        parts = name.split('-')
        if len(parts) > 1:
            fullWeek = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
            ind1 = fullWeek.index(parts[0].capitalize())
            ind2 = fullWeek.index(parts[1].capitalize())
            return ' '.join(fullWeek[ind1 : ind2 + 1])
        else:
            return parts[0].capitalize()
    
  3. Define a function generating a CustomBusinessHour object:

    def getCBH(row):
        wkMask = getWeekMask(row.day_in_week)
        hStart, hEnd = getHourLimits(row.time_in_day)
        return pd.offsets.CustomBusinessHour(weekmask=wkMask, start=hStart, end=hEnd)
    
  4. Add CustomBusinessHour objects to calendars:

    calendars['CBH'] = calendars.apply(getCBH, axis=1)
    

Then define a function computing all required means, for the given entity Id:

def getSums(entId):
    outRows = []
    wrk = df[df.entity_id.eq(entId)]    # Filter for entity Id
    for _, row in calendars.iterrows():
        dd = row.day_in_week
        hh = row.time_in_day
        cbh = row.CBH
        # Filter for the current calendar
        cnts = wrk[wrk.time.apply(lambda val: cbh.is_on_offset(val))]
        cnt = cnts.counts.mean()
        if pd.notnull(cnt):
            outRows.append(pd.Series([entId, dd, hh, cnt],
                index=['entity_id', 'day_in_week', 'time_in_day', 'counts_mean']))
    return pd.DataFrame(outRows)

As you can see, the result contains only non-null means.

And to generate the result, run:

pd.concat([getSums(entId) for entId in df.entity_id.unique()], ignore_index=True)

For your data sample (containing only readings from morning hours), the result is:

   entity_id day_in_week time_in_day  counts_mean
0        175         sun     morning     6.333333
1        175     sun-thu     morning     6.333333
2        178         sun     morning     5.000000
3        178     sun-thu     morning     5.000000
4        200         sun     morning     5.000000
5        200     sun-thu     morning     5.000000
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41