So I have data in .csv form showing the time which specific users walks into and out of a building over a few months. I am trying to use R to tabulate the building occupancy every 15/30 minutes for analysis.
The data has been cleaned and is in the form of a large ffdf
data frame (~11 million entries). The data frame has numeric
and factor
types (it includes character strings and numbers which are potentially useful later), and the clock-in/clock-out times are in columns formatted as POSIXct
.
If this were a much smaller data set, i would do the following:
- Create an empty .csv file (ie
occupancy
) to store occupancy level, with time as the column headers and date as rows. - Use a for loop to iterate the following across all rows:
- Calculate total time in building (clock out - clock in time)
- Extract the date from the clock-in entry, store as
inDate
- Use
round_any
from theplyr
package to find the nearest 15-minute ceiling for the start time, store asnearest15
- Find the number of complete 15-minute blocks building was occupied by user, via
floor(as.numeric((clockouttime - clockintime)/15))
- Add 1 to the relevant number of time blocks user was in building, starting from
occupancy[inDate, nearest15]
.
However a for loop that iterates across 11 million rows would not be efficient at all.
Does anyone know how to do this efficiently? I am at a loss - the apply
family of functions will coerce all the data into a single type as far as i know. No need for specific commands if you are not familiar with them, I would just like someone to point me to the right package and general idea for implementation.
I am using ff
to access the data currently, but if there are better packages that can do this I am open to suggestions.
Thanks.
EDIT: here's a edited snippet of the code i'm looking at:
user_hash, section_hash, dept_id, col_a, col_b, clockin_datetime, clockout_datetime
EEDD1DA7F38CA42A35CF3C003B,85C7,TS,1,,2013-08-08 12:52:00,2013-08-08 23:00:00
2BCB6AA1603BB4357BC0D390C9,BFA3,VS,1,,2013-08-08 12:48:00,2013-08-08 22:58:00
46D859B55C4802DF51445025C5,943B,TS,1,,2013-08-08 11:58:00,2013-08-08 16:04:00
FE4EEA83AF6EA50CA5738B5610,00B3,VT,1,,2013-08-08 19:56:00,2013-08-08 23:04:00
8DB43D322F0AEF6D2B877862C3,DB1F,TS,1,,2013-08-08 12:49:00,2013-08-08 13:03:00
4E636571D425A74CA6B5FA7909,1860,VS,1,,2013-08-08 12:21:00,2013-08-08 14:01:00
26B41FA581408BDFD747234640,FDA4,VS,1,,2013-08-08 20:38:00,2013-08-08 23:03:00
A6C3C190BFFDCB4194774C1026,45C0,VT,1,,2013-08-08 12:58:00,2013-08-08 20:03:00
938506D977353EA65DC6BB5260,1819,VT,1,,2013-08-08 12:54:00,2013-08-08 16:01:00
E82F9350DA9FFC73EE6A66A286,04C1,VT,1,,2013-08-08 08:42:00,2013-08-08 12:45:00
6B92F1AB6B3EE193430B6B2793,6C2E,TS,1,,2013-08-08 09:58:00,2013-08-08 13:03:00
2B88836D8A4CA5183AAE5D3D9A,497C,TS,2,,2013-08-08 10:35:00,2013-08-08 16:06:00
The desired output i have in mind is something like this, although any form that shows me occupancy at any given time period/date is fine.
date 12.00 12.15 12.30 12.45 .......
2013-08-01 1344 1632 3742 1024
2013-08-02 342 435 435 435
2013-08-03
2013-08-04
...