0

So I have data in .csv form showing the time which specific users walks into and out of a building over a few months. I am trying to use R to tabulate the building occupancy every 15/30 minutes for analysis.

The data has been cleaned and is in the form of a large ffdf data frame (~11 million entries). The data frame has numeric and factor types (it includes character strings and numbers which are potentially useful later), and the clock-in/clock-out times are in columns formatted as POSIXct.

If this were a much smaller data set, i would do the following:

  • Create an empty .csv file (ie occupancy) to store occupancy level, with time as the column headers and date as rows.
  • Use a for loop to iterate the following across all rows:
    • Calculate total time in building (clock out - clock in time)
    • Extract the date from the clock-in entry, store as inDate
    • Use round_any from the plyr package to find the nearest 15-minute ceiling for the start time, store as nearest15
    • Find the number of complete 15-minute blocks building was occupied by user, via floor(as.numeric((clockouttime - clockintime)/15))
    • Add 1 to the relevant number of time blocks user was in building, starting from occupancy[inDate, nearest15].

However a for loop that iterates across 11 million rows would not be efficient at all.

Does anyone know how to do this efficiently? I am at a loss - the apply family of functions will coerce all the data into a single type as far as i know. No need for specific commands if you are not familiar with them, I would just like someone to point me to the right package and general idea for implementation.

I am using ff to access the data currently, but if there are better packages that can do this I am open to suggestions.

Thanks.

EDIT: here's a edited snippet of the code i'm looking at:

user_hash, section_hash, dept_id, col_a, col_b, clockin_datetime, clockout_datetime EEDD1DA7F38CA42A35CF3C003B,85C7,TS,1,,2013-08-08 12:52:00,2013-08-08 23:00:00 2BCB6AA1603BB4357BC0D390C9,BFA3,VS,1,,2013-08-08 12:48:00,2013-08-08 22:58:00 46D859B55C4802DF51445025C5,943B,TS,1,,2013-08-08 11:58:00,2013-08-08 16:04:00 FE4EEA83AF6EA50CA5738B5610,00B3,VT,1,,2013-08-08 19:56:00,2013-08-08 23:04:00 8DB43D322F0AEF6D2B877862C3,DB1F,TS,1,,2013-08-08 12:49:00,2013-08-08 13:03:00 4E636571D425A74CA6B5FA7909,1860,VS,1,,2013-08-08 12:21:00,2013-08-08 14:01:00 26B41FA581408BDFD747234640,FDA4,VS,1,,2013-08-08 20:38:00,2013-08-08 23:03:00 A6C3C190BFFDCB4194774C1026,45C0,VT,1,,2013-08-08 12:58:00,2013-08-08 20:03:00 938506D977353EA65DC6BB5260,1819,VT,1,,2013-08-08 12:54:00,2013-08-08 16:01:00 E82F9350DA9FFC73EE6A66A286,04C1,VT,1,,2013-08-08 08:42:00,2013-08-08 12:45:00 6B92F1AB6B3EE193430B6B2793,6C2E,TS,1,,2013-08-08 09:58:00,2013-08-08 13:03:00 2B88836D8A4CA5183AAE5D3D9A,497C,TS,2,,2013-08-08 10:35:00,2013-08-08 16:06:00

The desired output i have in mind is something like this, although any form that shows me occupancy at any given time period/date is fine.

date 12.00 12.15 12.30 12.45 ....... 2013-08-01 1344 1632 3742 1024 2013-08-02 342 435 435 435 2013-08-03 2013-08-04 ...

ethane
  • 319
  • 3
  • 7
  • 13
  • Consider using `cut` to create a factor that divides the clock-ins and clock-outs into 15 minute chunks. Then use `table` to tabulate the numbers of clock-ins and clock-outs in each interval. The difference between the `cumsum(clockins)` and `cumsum(clockouts)` is your occupancy. See the second code chunk in [this article](http://stackoverflow.com/questions/28662110/categorizing-date-in-r/28662345#28662345) for how to use `lubridate` to create vector with a regular series of times that you can use to `cut()` your data into regular intervals. – Jthorpe Mar 13 '15 at 23:41
  • It would be more helpful if you could supply a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data and desired output. – MrFlick Mar 14 '15 at 02:57
  • @MrFlick, i just added some sample data. – ethane Mar 14 '15 at 03:25

1 Answers1

1

I've worked with similar data in the past and found that re-arranging the data can help. First, i'll assume your dates are properly encoded as date values and that your sample data is in a data.frame called dd. For example

dd <- structure(list(user_hash = structure(c(11L, 3L, 4L, 12L, 7L, 
5L, 1L, 9L, 8L, 10L, 6L, 2L), .Label = c("26B41FA581408BDFD747234640", 
"2B88836D8A4CA5183AAE5D3D9A", "2BCB6AA1603BB4357BC0D390C9", "46D859B55C4802DF51445025C5", 
"4E636571D425A74CA6B5FA7909", "6B92F1AB6B3EE193430B6B2793", "8DB43D322F0AEF6D2B877862C3", 
"938506D977353EA65DC6BB5260", "A6C3C190BFFDCB4194774C1026", "E82F9350DA9FFC73EE6A66A286", 
"EEDD1DA7F38CA42A35CF3C003B", "FE4EEA83AF6EA50CA5738B5610"), class = "factor"), 
    section_hash = structure(c(8L, 10L, 9L, 1L, 11L, 4L, 12L, 
    5L, 3L, 2L, 7L, 6L), .Label = c("00B3", "04C1", "1819", "1860", 
    "45C0", "497C", "6C2E", "85C7", "943B", "BFA3", "DB1F", "FDA4"
    ), class = "factor"), dept_id = structure(c(1L, 2L, 1L, 3L, 
    1L, 2L, 2L, 3L, 3L, 3L, 1L, 1L), .Label = c("TS", "VS", "VT"
    ), class = "factor"), col_a = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 2L), col_b = c(NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA), clockin_datetime = structure(c(1375980720, 
    1375980480, 1375977480, 1376006160, 1375980540, 1375978860, 
    1376008680, 1375981080, 1375980840, 1375965720, 1375970280, 
    1375972500), class = c("POSIXct", "POSIXt"), tzone = ""), 
    clockout_datetime = structure(c(1376017200, 1376017080, 1375992240, 
    1376017440, 1375981380, 1375984860, 1376017380, 1376006580, 
    1375992060, 1375980300, 1375981380, 1375992360), class = c("POSIXct", 
    "POSIXt"), tzone = "")), .Names = c("user_hash", "section_hash", 
"dept_id", "col_a", "col_b", "clockin_datetime", "clockout_datetime"
), row.names = c(NA, -12L), class = "data.frame")

Now, if you arranging the data to have a stream of in/out times and assign a value of +1 for people entering the building and -1 when they leave the building, you would have something like

dx <- rbind(
   data.frame(val=1, time=dd$clockin_datetime), 
   data.frame(val=-1, time=dd$clockout_datetime)
)
dx <- dx[order(dx$time), ]

Then, to find the number of people at any given time, you just need to do a cumulative sum on the val column

transform(dx, pop=cumsum(val))

Then you can split that out into intervals.

Working with data.tables rather than data.frames would probably be better performance-wise for data of your scale, but tweaking things to find out best what would work for your data would require a larger test case. But I think this general strategy could be quite useful.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • MrFlick, thanks! I will look into this method. I looked at the data.table package but i'm not sure if it allows for file-based access to data files. My data file is currently about 1GB which can definitely be loaded into memory, but I wanted to take this time to also practice working with large data beyond RAM capacity, and as far as i know data.table loads all the data into memory. I could be wrong, though - i'll investigate further. – ethane Mar 14 '15 at 19:03