EDIT:
I edited the question to include some more information and why I can't just use a simple sum/mean function:
I need to calculate the sum for a specific subset of rows within a certain timeframe and got this working fine with a for loop, however my dataset is huge and the for loop not very efficient, so I am trying to use the apply function instead, but can't get it to work.
Here is a working example for a simple dataset. What I need is to calculate the sum of each category but only from any category with that value in the last hour and add that as a new column to this dataset. I am working with unix to make it easier and just work with numbers.
(Note: In reality I want to know the average delay for all trains that went between station A and station B in the last hour - just to bring it into context)
#Create basic dataframe:
category = c("a", "b", "a", "c", "a")
value = c(1:5)
time = c(1444305900,1444306587, 1444305600, 1444291200, 1444291900 )
sum = NA
d = data.frame(category,time,value, sum, stringsAsFactors = FALSE)
This returns:
> d
category time value
1 a 1444305900 1
2 b 1444306587 2
3 a 1444305600 3
4 c 1444291200 4
5 a 1444291900 5
Note that the first a
was at 12.05pm, the second at 12.00pm and the third at 8am. So when I want to get the mean for the previous hour, the first a
had no other a
in the previous hour, whereas the second a
has itself and another a
5 minutes earlier and therefore the sum needs to be calculated from the first and second a
.
Then I run a for loop to calculate the sum for each category and add it into the previously created sum column:
#Run it as a loop:
for (i in 1:nrow(d)){
pickcategory = d[i,c("category")] #here I select my categroy that I want to filter on
pickunix = d[i,c("time")]
filterrows = filter(d, grepl(pickcategory,category)) #here I am subsetting the entire dataframe for only those rows containing this category
filterhour = filterrows[filterrows$time <= pickunix & filterrows$time > (pickunix-3600),] #subset for previous hour
getsum = sum(filterhour$value) #get the mean value for that category
d$sum[i] = getsum #add that mean value to that row
}
This provides my desired output:
category time value sum
1 a 1444305900 1 4
2 b 1444306587 2 2
3 a 1444305600 3 3
4 c 1444291200 4 4
5 a 1444291900 5 5
BUT: This is slow for 100 Million rows, so I tried apply but I do not get it work. All tutorials around apply are very simple ones where people don't use custom written function, so I can't figure it out.
It already goes wrong in the first line with pickcategory
as it select the whole column rather than the value for that row that I want to run the function on.
testfunction= function(d){
pickcategory = d$category #here I select my categroy that I want to filter on
pickunix = d$time
filterrows = filter(d, grepl(pickcategory,category)) #here I am subsetting the entire dataframe for only those rows containing this category
filterhour = filterrows[filterrows$time <= pickunix & filterrows$time > (pickunix-3600),] #subset for previous hour
getsum = sum(filterhour$value) #get the mean value for that category
d$sum = getsum #add that mean value to that row
return(d)
}
output = apply(d, 1, function(x) testfunction(d))
Can anyone tell me how to make that for loop into an efficient apply function?
Please note that my actual example doesn't calculate the sum, but something more complicated, so this needs to work for any sort of calculation and category selection I want to do.
Any help would be really appreciated.