1

I have log data that spans over 30 days. I'm am looking to group the data to see what 15 minute window has the lowest amount of events in total over 24hours. The data is formated as so:

2021-04-26 19:12:03, upload
2021-04-26 11:32:03, download
2021-04-24 19:14:03, download
2021-04-22 1:9:03, download
2021-04-19 4:12:03, upload
2021-04-07 7:12:03, download

and I'm looking for a result like

19:15:00, 2
11:55:00, 1
7:15:00, 1
4:15:00, 1
1:15:00, 1

currently, I used grouper:

df['date'] = pd.to_datetime(df['date'])
df.groupby(pd.Grouper(key="date",freq='.25H')).Host.count()

and my results are looking like\

date
2021-04-08 16:15:00+00:00     1
2021-04-08 16:30:00+00:00    20
2021-04-08 16:45:00+00:00     6
2021-04-08 17:00:00+00:00     6
2021-04-08 17:15:00+00:00     0
                             ..
2021-04-29 18:00:00+00:00     3
2021-04-29 18:15:00+00:00     9
2021-04-29 18:30:00+00:00     0
2021-04-29 18:45:00+00:00     3
2021-04-29 19:00:00+00:00    15

Is there any way so I can not merge again on just the time and not include the date?

TBis
  • 15
  • 4

3 Answers3

1

So let's say you want to gather in 5 min window. For this, you need to extract the time-stamp column. Let df is your pandas dataframe. For each time in timestamp, roundup that time to nearest multiple of 5 min and add to a counter map. See code below.

timestamp = df["timestamp"]
counter = collections.defaultdict(int)

def get_time(time):
    hh, mm, ss = map(int, time.split(':'))
    total_seconds = hh * 3600 + mm * 60 + ss
    roundup_seconds = math.ceil(total_seconds / (5*60)) * (5*60) 
    # I suggest you to try out the above formula on paper for better understanding
    # '5 min' means '5*60 sec' roundup
    new_hh = roundup_seconds // 3600
    roundup_seconds %= 3600
    new_mm = roundup_seconds // 60
    roundup_seconds %= 60
    new_ss = roundup_seconds
    return f"{new_hh}:{new_mm}:{new_ss}"  # f-strings for python 3.6 and above

for time in timestamp:
    counter[get_time(time)] += 1

# Now counter will carry counts of rounded time stamp
# I've tested locally and it's same as the output you mentioned. 
# Let me know if you need any further help :)
T.M15
  • 366
  • 2
  • 15
1

One approach is to use TimeDelta instead of DateTime since the comparison happens only between hours and minutes not dates.

import pandas as pd
import numpy as np

df = pd.DataFrame({'time': {0: '2021-04-26 19:12:03', 1: '2021-04-26 11:32:03',
                            2: '2021-04-24 19:14:03', 3: '2021-04-22 1:9:03',
                            4: '2021-04-19 4:12:03', 5: '2021-04-07 7:12:03'},
                   'event': {0: 'upload', 1: 'download', 2: 'download',
                             3: 'download', 4: 'upload', 5: 'download'}})

# Convert To TimeDelta (Ignore Day)
df['time'] = pd.to_timedelta(df['time'].str[-8:])

# Set TimeDelta as index
df = df.set_index('time')
# Get Count of events per 15 minute period
df = df.resample('.25H')['event'].count()

# Convert To Nearest 15 Minute Interval
ns15min = 15 * 60 * 1000000000  # 15 minutes in nanoseconds
df.index = pd.to_timedelta(((df.index.astype(np.int64) // ns15min + 1) * ns15min))

# Reset Index, Filter and Sort
df = df.reset_index()
df = df[df['event'] > 0]
df = df.sort_values(['event', 'time'], ascending=(False, False))
# Remove Day Part of Time Delta (Convert to str)
df['time'] = df['time'].astype(str).str[-8:]

# For Display
print(df.to_string(index=False))

Filtered Output:

    time  event
19:15:00      2
21:00:00      1
11:30:00      1
07:15:00      1
04:15:00      1
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
1

Do you want something like this?

Here, the idea is - If you're not concern about the date, then you can replace all the dates with some random date, and then you can group/count the data based on time data only.

df.Host = 1 
df.date = df.date.str.replace( r'(\d{4}-\d{1,2}-\d{1,2})','2021-04-26', regex=True)
df.date = pd.to_datetime(df.date)
new_df = df.groupby(pd.Grouper(key='date',freq='.25H')).agg({'Host' : sum}).reset_index()
new_df = new_df.loc[new_df['Host']!=0]
new_df['date'] = new_df['date'].dt.time

Nk03
  • 14,699
  • 2
  • 8
  • 22