This code generates a dataset similar to my own:
df <- c(seq(as.Date("2012-01-01"), as.Date("2012-01-10"), "days"))
df <- as.data.frame(df)
df <- rbind(df, df)
id <- c(rep.int(1, 10), rep.int(2, 10))
id <- as.data.frame(id)
cnt <- c(1:3, 0, 0, 4, 5:8, 0, 1, 0, 1:7)
cnt <- as.data.frame(cnt)
df <- cbind(id, df, cnt)
names(df) <- c("id", "date", "cnt")
df$date[df$date == "2012-01-10"] <- "2012-01-20"
I'm trying to find the sum of variable 'cnt' that has occurred within the last 7 days. Sometimes dates are not continuous (see the last date in the preceeding 'df') -- by id.
Here's the loop:
system.time(
for(i in 1:length(df$date)) {
df$cnt.weekly[i] <-
sum(df$cnt[which((df$date == df$date[i] - 1) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 2) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 3) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 4) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 5) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 6) & df$id == df$id[i])])})
I'm ultimately running this on an 8 million row data.frame (thousands of ids), so while the toy is fast here it is very slow in practice.
I've had very good luck with the data.table package in other parts of the code, but I can't figure out how to get it to work here. Maybe lapply inside of data.table?
Thanks in advance!