-1

Context: WiFi DHCP leases recorded in syslog.

A DHCP lease is for 24 hours and a record represents a successful WiFi "session". Some devices like phones will initiate multiple sessions per day so we need to count only unique mac addresses per 24 hours. But we want to know the timestamp of first connection. Finally, we need to be able to sum by hour and day.

TL;DR: Need to show daily unique mac addresses broken down by hour. Not unique by the hour, but by day ...then broken down by hour and summed.

Sample data frame:

branch          timestamp  mac 
0   branch_a    2020-09-01 00:00:00 48:c7:96:1d:91:af
1   branch_a    2020-09-01 00:08:00 48:c7:96:1d:91:bx   
2   branch_b    2020-09-01 00:36:07 48:c7:96:1d:80:ff   
3   branch_b    2020-09-01 00:41:24 48:c7:96:1d:86:ff   
4   branch_c    2020-09-01 00:44:33 48:c7:96:1d:76:bv

Steps:

  1. group by branch
  2. First occurance or unique mac addresses by day
  3. Sum mac addresses by hour

This shows the same mac.

branch_daily = wifi.groupby(['branch','month', 'timestamp'])['mac'].first()

Intended result:

branch          timestamp           mac 
0   branch_a    2020-09-01 00:00:00 5
1   branch_a    2020-09-01 00:01:00 10   
2   branch_a    2020-09-01 00:02:00 3   
3   branch_a    2020-09-01 00:03:00 4   
4   branch_a    2020-09-01 00:04:00 11

Where mac is the sum by hour.

wifi['timestamp'] = pd.to_datetime(wifi['timestamp'], format='%b %d %Y %H:%M:%S')
wifi['month']     = wifi['timestamp'].dt.month
wifi['day']       = wifi['timestamp'].dt.day
wifi['hour']      = wifi['timestamp'].dt.hour

uniq_per_day = wifi.drop_duplicates(subset=['day','mac'], keep='first')

# Hourly
uniq_per_day.groupby(['branch','month','day','hour']).agg({'mac':'count'})

# Daily
uniq_per_day.groupby(['branch','month','day']).agg({'mac':'count'})

#...etc.
Bubnoff
  • 3,917
  • 3
  • 30
  • 33
  • `Not unique by the hour, but by day ...then broken down by hour and summed.` These seem like contradictory requirements. Are they unique for each hour or each day? – Nick ODell Oct 09 '20 at 03:30
  • @NickODell I will add some background in the question text. Thank you for asking. Basically, DHCP leases are for 24 hours and some devices like phones will create sessions multiple times per day. So we need to look at unique macs per day. However, we also want to preserve the time of first session to we can gauge busy times of day. – Bubnoff Oct 09 '20 at 16:57
  • 1
    You could do that in three steps - First extract the date from the timestamp. Second, do a drop_duplicates with keep=first and subset=day/mac. Third, do a groupby and count. – Nick ODell Oct 09 '20 at 17:12
  • @NickODell That sounds like just the ticket. I'm going to give it a shot. I think the `drop_duplicates` is the key I was looking for. Need to look at those docs. Thanks! – Bubnoff Oct 09 '20 at 17:21
  • @NickODell `wifi.drop_duplicates(subset=['day','mac'], keep='first')` -- Just to clarify: this will preserve unique macs per day, keeping the first occurence. `wifi.groupby(['branch','month','day','hour']).agg({'mac':'nunique'})` -- so this should give me the hourly stats, correct? Or should it be `wifi.groupby(['branch','month','day','hour']).agg({'mac':'count'})` `count` yields a higher number and it's already been "uniq-ed" in a previous step so I'm thinking it should be "count". – Bubnoff Oct 09 '20 at 17:39
  • @NickODell you solved it! If you create an answer, I'll check it. Thank you!! `drop_duplicates` was key. – Bubnoff Oct 09 '20 at 17:54

1 Answers1

0

You can do this in three steps:

  1. First extract the date from the timestamp. Example.
  2. Second, do a drop_duplicates with keep=first and subset=day/mac. Documentation.
  3. Third, do a groupby and count. Example.
Nick ODell
  • 15,465
  • 3
  • 32
  • 66