I was working on a dataframe with 200.000+ rows and many columns. Let's take a sample dummy version as such that df :
set.seed(1)
"timeslot" = c(as.integer(abs(runif(10000,min=1,max=1000))))
"ID" = c(LETTERS[abs(as.integer(rnorm(10000,2)**3))%%9+1])
"variable1" = c(as.integer(rnorm(10000,2)**3))
df = data.frame(timeslot,ID,variable1)
df = df[order(df$timeslot, df$ID),]
I also calculate a column to check if the ID of that row is also present somewhere in the previous timeslot, called min1:
df$min1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "timeslot"] == 1){0} else {
max(df[x, "ID"] %in% df[df$timeslot == df[x,"timeslot"] - 1,"ID"])}
})
This all goes quite well and delivers the following head(df)/tail(df):
timeslot ID variable1 min1
4919 1 A 15 0
2329 1 C 48 0
7359 1 C 1 0
1978 1 E 6 0
2883 1 F 7 0
7448 1 F 21 0
-------------------------------
8462 998 F 1 1
1724 998 H 2 0
989 999 A 7 1
2589 999 D 12 1
3473 999 D 0 1
780 999 I 5 0
I want to perform some calculations on variable1, grouped by unique timeslot+ID. One of these calculations is funfac:
total=0
funfac <- function(x,y){ for (i in x){ (i <- i ** y);
total <- total + i};return((abs(total/(length(x))))**(1/y));total=0 }
However, now comes the difficult part: per ID in a specific timeslot I want to do a calculation over all same IDs in that timeslot and the previous timeslot. So if in timeslot '2' there are 3x D, and in timeslot '1' there are 2x D, the calculation should be done over all 5 Ds. My column min1 helps identify if that ID is present in the previous timeslot. If not: the calculation should return a NA.
First I did this with the following code:
lp5 = c()
for (j in 1:nrow(df)){
if (df[j,"min1"] == 0){lp5 = c(lp5,NA)} else {
total = 0
x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
for (i in x){
i = (i ** 5);
total <- total + i
}
lp5 = c(lp5,((abs(total/(length(x))))**(1/(5))))
}
}
tempdf = data.frame(df[,"timeslot"],df[,"ID"], lp5)
lp5 = tempdf[!duplicated(tempdf[,1:2]),][,3]
Figuring that I performed a lot of calculations double, I thought: Why not check if the calculation has been done already. Doing so by adding the unique timeframe+ID in a dataframe, including the calculated value. And each time checking if the value is in the dataframe already.
lp5DF = data.frame("timeslot" = numeric(0), "ID" = character(0), "lp5" = numeric(0))
for (j in 1:nrow(df)){
if (duplicated(rbind(lp5DF[,1:2],data.frame(timeslot=df[j,"timeslot"], ID=df[j,"ID"])))[nrow(lp5DF)+1]) {next} else{
if (df[j,"min1"] == 0){lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = NA))} else {
total = 0
x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
for (i in x){
(i <- i ** 5);total <- total + i
}
lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = ((abs(total/(length(x))))**(1/5)))) }
}
}
The output (head/tail) of lp5DF will be:
timeslot ID lp5
1 1 A NA
2 1 B NA
3 1 C NA
4 1 D NA
5 1 E NA
6 1 F NA
-------------------------
7738 999 B 14.83423
7739 999 C 14.80149
7740 999 E NA
7741 999 F 49.48538
7742 999 G 23.05222
7743 999 H NA
and: lp5DF[,3]==lp5
However, checking this seemed to be a lot slower (6.5x in my case). Since I have to run this kind of calculation multiple times over a lot of rows (dataframe may be expanded later in the project) both my ways are too slow. Why is the second one so slow, and is there a way to speed this up? Maybe something with lapply
or the dplyr
package?