4

I have a matrix, events, that contains the times of occurrences of 5 million events. Each of these 5 million events has a "type" that ranges from 1 to 2000. A very simplified version of the matrix is as below. The units for "times" is seconds since 1970. All of the events have occurred since 1/1/2012.

>events
      type          times
      1           1352861760
      1           1362377700
      2           1365491820
      2           1368216180
      2           1362088800
      2           1362377700

I am trying to divide the time since 1/1/2012 into 5-minute buckets and then populate each of these buckets with how many of each event of type i has occurred in each bucket. My code is below. Note that types is a vector containing each possible type from 1-2000, and by is set to 300 because that is how many seconds are in 5 minutes.

for(i in 1:length(types)){
    local <- events[events$type==types[i],c("type", "times")]
    assign(sprintf("a%d", i),table(cut(local$times, breaks=seq(range(events$times)[1],range(events$times)[2], by=300))))
}

This results in variables a1 through a2000 which contains a row vector of how many occurrences of type i there were in each of the 5-minute buckets.

I proceed to then find all pairwise correlations between 'a1' and 'a2000'.

Is there a way to optimize the chunk of code I provided above? It runs very slow, yet I can't think of a way to make it faster. Perhaps there are just too many buckets and too little time.

Any insight would be much appreciated.

Reproducible example:

>head(events)
     type         times
      12           1308575460
      12           1308676680
      12           1308825420
      12           1309152660
      12           1309879140
      25           1309946460

xevents <- xts(events[,"type"],.POSIXct(events[,"times"]))
ep <- endpoints(xevents, "minutes", 5)
counts <- period.apply(xevents, ep, tabulate, nbins=length(types))

>head(counts)
                       1    2    3    4    5   6    7    8    9   10   11  12   13   14
2011-06-20 09:11:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-06-21 13:18:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-06-23 06:37:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-06-27 01:31:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-07-05 11:19:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-07-06 06:01:00    0    0    0    0    0   0    0    0    0    0    0   0    0   0

>> ep[1:20]
[1]  0  1  2  3  4  5  6  7  8  9 10 12 20 21 22 23 24 25 26 27

Above is the code I have been using, but the problem is that it hasn't incremented by 5 minutes: it just increments by the occurrences of actual events.

user2588829
  • 1,523
  • 3
  • 10
  • 20
  • Your "reproducible example" isn't [reproducible](http://stackoverflow.com/q/5963269/271616) and you don't show what you want the output to be, but I assume you want a row at every 5-minute observation, regardless of whether or not you actually have data in that interval. – Joshua Ulrich Jul 26 '13 at 10:47

3 Answers3

3

I would use the xts package for this. Running a function over non-overlapping 5-minute intervals is easy with the period.apply and endpoints functions.

# create sample data
library(xts)
set.seed(21)
N <- 1e6
events <- cbind(sample(2000, N, replace=TRUE),
  as.POSIXct("2012-01-01")+sample(1e7,N))
colnames(events) <- c("type","times")
# create xts object
xevents <- xts(events[,"type"], .POSIXct(events[,"times"]))
# find the last row of each non-overlapping 5-minute interval
ep <- endpoints(xevents, "minutes", 5)
# count the number of occurrences of each "type"
counts <- period.apply(xevents, ep, tabulate, nbins=2000)
# set colnames
colnames(counts) <- paste0("a",1:ncol(counts))
# calculate correlation
#cc <- cor(counts)

Update to respond to OP's comments/edits:

# Create a sequence of 5-minute steps, from the actual start of the data
m5 <- seq(round(start(xevents),'mins'), end(xevents), by='5 mins')
# Create a sequence of 5-minute steps, from the start of 2012-01-01
m5 <- seq(as.POSIXct("2012-01-01"), end(xevents), by='5 mins')
# merge xevents with empty 5-minute xts object, and
# subtract 1 second, so endpoints are at end of each 5-minute interval
xevents5 <- merge(xevents, xts(,m5-1))
ep5 <- endpoints(xevents5, "minutes", 5)
counts5 <- period.apply(xevents5, ep5, tabulate, nbins=2000)
colnames(counts5) <- paste0("a",1:ncol(counts5))
# align to the beginning of each 5-minute interval, if you want
counts5 <- align.time(counts5,60*5)
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • this code is great! never knew about the xts library until now. the .POSIXct step, however, is converting my dates wrong and is leading to miscalculations ... any idea how to fix this? – user2588829 Jul 25 '13 at 16:35
  • @user2588829: I'd have an idea how to fix it if you were less vague... "converting my dates wrong" tells me nothing. – Joshua Ulrich Jul 25 '13 at 16:39
  • well, converting it using the .POSIXct function (the exact function I'm using is: `as.POSIXct(strptime(x, format="%m/%d/%y %H:%M:%S", tz="GMT"), origin="1970-01-01")`) is making something that was originally 11/14/2012 02:56 into 1970-01-07 14:28:44. – user2588829 Jul 25 '13 at 16:42
  • @user2588829: the `as.POSIXct` call in your example, with `x <- "11/14/2012 02:56"` results in `NA`. And you don't use `as.POSIXct` in your question, so I still don't understand what's wrong. Please provide something reproducible. – Joshua Ulrich Jul 25 '13 at 16:46
  • @user2588829: You initially said the `"times"` column was seconds since 1970-01-01. Now you seem to be saying it's a character vector. Please update your question to represent the actual problem you're facing. – Joshua Ulrich Jul 25 '13 at 18:10
3

With 5 million records I would probably use data.table. You can achieve this like this:

#  First we make a sequence of the buckets from initial time to at least the end time + 5 minutes
buckets <- seq( from = min( df$times ) , by = 300 , to = max( df$times )+300 )

require( data.table )
DT <- data.table( df )

#  Work out what bucket each time is in
DT[ , list( Bucket = which.max(times <= buckets ) ) , by = "times" ]

#  Aggregate events by type and time bucket
DT[ , list( Count = length( type ) ) , by = list( type, bucket) ]
   type bucket Count
1:    1      1     1
2:    1  31721     1
3:    2  42102     1
4:    2  51183     1
5:    2  30758     1
6:    2  31721     1
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
1

cut it within the range of times, just the way you did. After that, you can tabulate using table or xtabs, but for the entire dataset, to generate a matrix. Something like the following:

r <- trunc(range(events$times) / 300) * 300
events$times.bin <- cut(events$times, seq(r[1], r[2] + 300, by=300))
xtabs(~type+times.bin, events, drop.unused.levels=T)

Decide if you want to drop.unused.levels or not. With this kind of data, you might also want to create a sparse matrix.

krlmlr
  • 25,056
  • 14
  • 120
  • 217