Why is a calculating for every row faster than checking if calculation has been done?

Question

I was working on a dataframe with 200.000+ rows and many columns. Let's take a sample dummy version as such that df :

set.seed(1)
"timeslot" = c(as.integer(abs(runif(10000,min=1,max=1000))))
"ID" = c(LETTERS[abs(as.integer(rnorm(10000,2)**3))%%9+1])
"variable1" = c(as.integer(rnorm(10000,2)**3))
df = data.frame(timeslot,ID,variable1)
df = df[order(df$timeslot, df$ID),]

I also calculate a column to check if the ID of that row is also present somewhere in the previous timeslot, called min1:

df$min1 <- sapply(seq(nrow(df)), function(x)
{
  if(df[x, "timeslot"] == 1){0} else {
    max(df[x, "ID"] %in% df[df$timeslot == df[x,"timeslot"] - 1,"ID"])}
})

This all goes quite well and delivers the following head(df)/tail(df):

     timeslot ID variable1 min1
4919        1  A        15    0
2329        1  C        48    0
7359        1  C         1    0
1978        1  E         6    0
2883        1  F         7    0
7448        1  F        21    0
-------------------------------
8462      998  F         1    1
1724      998  H         2    0
989       999  A         7    1
2589      999  D        12    1
3473      999  D         0    1
780       999  I         5    0

I want to perform some calculations on variable1, grouped by unique timeslot+ID. One of these calculations is funfac:

total=0
funfac <- function(x,y){  for (i in x){ (i <- i ** y);
total <- total + i};return((abs(total/(length(x))))**(1/y));total=0 }

However, now comes the difficult part: per ID in a specific timeslot I want to do a calculation over all same IDs in that timeslot and the previous timeslot. So if in timeslot '2' there are 3x D, and in timeslot '1' there are 2x D, the calculation should be done over all 5 Ds. My column min1 helps identify if that ID is present in the previous timeslot. If not: the calculation should return a NA.

First I did this with the following code:

lp5 = c() 
for (j in 1:nrow(df)){
  if (df[j,"min1"] == 0){lp5 = c(lp5,NA)} else {
    total = 0
    x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
    for (i in x){
      i = (i ** 5);
      total <- total + i
    }
    lp5 = c(lp5,((abs(total/(length(x))))**(1/(5))))
  }
}
tempdf = data.frame(df[,"timeslot"],df[,"ID"], lp5)
lp5 = tempdf[!duplicated(tempdf[,1:2]),][,3]

Figuring that I performed a lot of calculations double, I thought: Why not check if the calculation has been done already. Doing so by adding the unique timeframe+ID in a dataframe, including the calculated value. And each time checking if the value is in the dataframe already.

lp5DF = data.frame("timeslot" = numeric(0), "ID" = character(0), "lp5" = numeric(0))
for (j in 1:nrow(df)){
  if (duplicated(rbind(lp5DF[,1:2],data.frame(timeslot=df[j,"timeslot"], ID=df[j,"ID"])))[nrow(lp5DF)+1]) {next} else{
    if (df[j,"min1"] == 0){lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = NA))} else {
      total = 0
      x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
      for (i in x){
        (i <- i ** 5);total <- total + i
      }
      lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = ((abs(total/(length(x))))**(1/5))))               }
  }
}

The output (head/tail) of lp5DF will be:

  timeslot ID lp5
1        1  A  NA
2        1  B  NA
3        1  C  NA
4        1  D  NA
5        1  E  NA
6        1  F  NA
-------------------------
7738      999  B 14.83423
7739      999  C 14.80149
7740      999  E       NA
7741      999  F 49.48538
7742      999  G 23.05222
7743      999  H       NA

and: lp5DF[,3]==lp5

However, checking this seemed to be a lot slower (6.5x in my case). Since I have to run this kind of calculation multiple times over a lot of rows (dataframe may be expanded later in the project) both my ways are too slow. Why is the second one so slow, and is there a way to speed this up? Maybe something with lapply or the dplyr package?

`rbind.data.frame` is extremely expensive for performance, even more so than the bad practice of growing a vector with `c`. Learn to pre-allocate or preferably use vectorized operations instead of loops. — Roland, Dec 07 '16 at 13:54
I just started programming last week, so sorry for my bad (programming) grammar. Do you have any documentation on how to pre-allocate or using vectorized operations? — Paul, Dec 07 '16 at 14:00
One example of preallocation. Change `lp5 = c()` to `lp5 = numeric(nrow(df))`. This sets up a numeric vector of the proper size in memory prior to the loop. The vector can then be filled with `lp5[i] = c(lp5,((abs(total/(length(x))))**(1/(5))))` in the `else` block for example. — lmo, Dec 07 '16 at 14:31
A really good post to [read](http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r)... — Christoph, Dec 07 '16 at 14:45

score 2 · Accepted Answer · edited May 23 '17 at 12:26

2

There are just a lot to optimize. Try learning data manipulation packages like dplyr, data.table.

min1 can be calculated using the technique from here

library(dplyr)
dfs <- split(df$ID, df$timeslot)
df$min1 <- unlist(mapply(`%in%`, dfs,  lag(dfs)))

lp5 is little tricky, but manageable

df1 <- df %>% 
  group_by(timeslot, ID) %>% 
  summarise(min1 = all(min1), s = sum(variable1^5), n = n()) %>% 
  group_by(ID) %>% 
  mutate(s1 = s + lag(s), n1 = n + lag(n), lp5 = ifelse(min1, abs((s1/n1)^(1/5)), NA)) 
lp5 <- df1$lp5

data.table equivalent is

library(data.table)
setDT(df)
dt1 <- df[, .(min1 = all(min1), s = sum(variable1^5), n = .N), by=.(timeslot, ID)]
dt1[, `:=`(s1 = s + shift(s), n1 = n + shift(n)), by=ID]
dt1[min1==TRUE, lp5 := abs((s1/n1)^(1/5)), by=ID]
lp5 <- dt1$lp5

edited May 23 '17 at 12:26

Community

1
1

answered Dec 07 '16 at 20:55

ExperimenteR

4,453
1
15
19

Thank you, I will try it first thing in the morning! – Paul Dec 08 '16 at 17:05
It works! And if I understand the formula correctly, if I wanted to take a timeslot which is two ahead of the current one I would use something like: 'lead(dfs, 2)'? – Paul Dec 09 '16 at 08:31
1

It works, however I see that sometimes the formula returns a NA when min1 is False, and sometimes the value zero. This should be NA always. Should I put an ifelse statement in the mutate function? `lp5 = ifelse(min1==F, NA, abs((s1/n1)^(1/5))*min1)` (the *min1 seems superfluous to me then, since it will always be 1) I tested the solution and it still works, and is still fast. However maybe again I'm falling back to my bad habits of using ifelse/for-loops/etc. – Paul Dec 09 '16 at 09:38
1

Nothing is bad about ifelse and for loop, if used properly :) – ExperimenteR Dec 09 '16 at 09:44

Why is a calculating for every row faster than checking if calculation has been done?

1 Answers1