I have a matrix, events
, that contains the times of occurrences of 5 million events. Each of these 5 million events has a "type" that ranges from 1 to 2000. A very simplified version of the matrix is as below. The units for "times" is seconds since 1970. All of the events have occurred since 1/1/2012.
>events
type times
1 1352861760
1 1362377700
2 1365491820
2 1368216180
2 1362088800
2 1362377700
I am trying to divide the time since 1/1/2012 into 5-minute buckets and then populate each of these buckets with how many of each event of type i
has occurred in each bucket. My code is below. Note that types
is a vector containing each possible type from 1-2000, and by
is set to 300 because that is how many seconds are in 5 minutes.
for(i in 1:length(types)){
local <- events[events$type==types[i],c("type", "times")]
assign(sprintf("a%d", i),table(cut(local$times, breaks=seq(range(events$times)[1],range(events$times)[2], by=300))))
}
This results in variables a1
through a2000
which contains a row vector of how many occurrences of type i
there were in each of the 5-minute buckets.
I proceed to then find all pairwise correlations between 'a1' and 'a2000'.
Is there a way to optimize the chunk of code I provided above? It runs very slow, yet I can't think of a way to make it faster. Perhaps there are just too many buckets and too little time.
Any insight would be much appreciated.
Reproducible example:
>head(events)
type times
12 1308575460
12 1308676680
12 1308825420
12 1309152660
12 1309879140
25 1309946460
xevents <- xts(events[,"type"],.POSIXct(events[,"times"]))
ep <- endpoints(xevents, "minutes", 5)
counts <- period.apply(xevents, ep, tabulate, nbins=length(types))
>head(counts)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2011-06-20 09:11:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2011-06-21 13:18:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2011-06-23 06:37:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2011-06-27 01:31:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2011-07-05 11:19:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2011-07-06 06:01:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> ep[1:20]
[1] 0 1 2 3 4 5 6 7 8 9 10 12 20 21 22 23 24 25 26 27
Above is the code I have been using, but the problem is that it hasn't incremented by 5 minutes: it just increments by the occurrences of actual events.