R Using apply to calculate some stats for specific rows in a large dataframe within a certain timeframe

Question

EDIT:

I edited the question to include some more information and why I can't just use a simple sum/mean function:

I need to calculate the sum for a specific subset of rows within a certain timeframe and got this working fine with a for loop, however my dataset is huge and the for loop not very efficient, so I am trying to use the apply function instead, but can't get it to work.

Here is a working example for a simple dataset. What I need is to calculate the sum of each category but only from any category with that value in the last hour and add that as a new column to this dataset. I am working with unix to make it easier and just work with numbers.

(Note: In reality I want to know the average delay for all trains that went between station A and station B in the last hour - just to bring it into context)

#Create basic dataframe: 
category = c("a", "b", "a", "c", "a")
value = c(1:5)
time = c(1444305900,1444306587, 1444305600, 1444291200, 1444291900 )
sum = NA
d = data.frame(category,time,value, sum, stringsAsFactors = FALSE)
This returns:
> d
  category       time value
1        a 1444305900     1
2        b 1444306587     2
3        a 1444305600     3
4        c 1444291200     4
5        a 1444291900     5

Note that the first a was at 12.05pm, the second at 12.00pm and the third at 8am. So when I want to get the mean for the previous hour, the first a had no other a in the previous hour, whereas the second a has itself and another a 5 minutes earlier and therefore the sum needs to be calculated from the first and second a.

Then I run a for loop to calculate the sum for each category and add it into the previously created sum column:

#Run it as a loop: 
for (i in 1:nrow(d)){
  pickcategory = d[i,c("category")]   #here I select my categroy that I want to filter on
  pickunix = d[i,c("time")]
  filterrows = filter(d, grepl(pickcategory,category)) #here I am subsetting the entire dataframe for only those rows containing this category
  filterhour = filterrows[filterrows$time <= pickunix & filterrows$time > (pickunix-3600),] #subset for previous hour
  getsum = sum(filterhour$value)  #get the mean value for that category
  d$sum[i] = getsum  #add that mean value to that row
}

This provides my desired output:

  category       time value sum
1        a 1444305900     1   4
2        b 1444306587     2   2
3        a 1444305600     3   3
4        c 1444291200     4   4
5        a 1444291900     5   5

BUT: This is slow for 100 Million rows, so I tried apply but I do not get it work. All tutorials around apply are very simple ones where people don't use custom written function, so I can't figure it out.

It already goes wrong in the first line with pickcategory as it select the whole column rather than the value for that row that I want to run the function on.

testfunction= function(d){
  pickcategory = d$category   #here I select my categroy that I want to filter on
  pickunix = d$time
  filterrows = filter(d, grepl(pickcategory,category)) #here I am subsetting the entire dataframe for only those rows containing this category
  filterhour = filterrows[filterrows$time <= pickunix & filterrows$time > (pickunix-3600),] #subset for previous hour
  getsum = sum(filterhour$value)  #get the mean value for that category
  d$sum = getsum #add that mean value to that row
  return(d)
  }

output = apply(d, 1, function(x) testfunction(d))

Can anyone tell me how to make that for loop into an efficient apply function?

Please note that my actual example doesn't calculate the sum, but something more complicated, so this needs to work for any sort of calculation and category selection I want to do.

Any help would be really appreciated.

Try filtering your `d$sum` when you fill it, something along the line `d$sum[d$category == pickcategory] = getsum` should do — Tensibai, Mar 18 '16 at 10:43
This is just a simple "sum/mean by group" question. If you need it very fast, try `data.table`. Create your data set without predefining `sum`, by just using `d <- data.frame(category,value)`. And then simply `library(data.table) ; setDT(d)[, Sum := sum(value), by = category]` and you all set. If you want the mean instead of sum just do `setDT(d)[, Mean := mean(value), by = category]`. Or as it appears you are using dplyr here, it could done with `d %>% group_by(category) %>% mutate(Sum = sum(value))` — David Arenburg, Mar 18 '16 at 10:47
Side note to avoid removing factors by hand, add `stringsAsFactors=FALSE` in your `data.frame` call. — Tensibai, Mar 18 '16 at 10:51
btw: giving a column the same name as a function is not wise — Jaap, Mar 18 '16 at 11:14

R Using apply to calculate some stats for specific rows in a large dataframe within a certain timeframe

0 Answers0