processing speed difference between for loop and dplyr

Question

i've wrote a function with a simple for loop in R after a while someone propose to me an other way to do the same thing but with dplyr.So I tryed and I saw a strong difference in the time used to run my script (- 1s !). I'm wondering from where does this huge difference of time used come. Is dplyr just way more optimized ? Is dplyr compiled in a sort of way that speed up the process ? I dont't know

my original function :

key.rythm <- function(key, data) {
  ## Un data frame vide pour recevoir les resultats
  results <-
    data.frame(
      "down.time" = numeric(),
      "duration" = numeric(),
      "touche" = factor()
    )
  down.time <- NULL
  
  ## On est oblige de passer par une boucle pour parser ligne par ligne
  for (i in 1:nrow(data)) {
    
    if (data[i, "K.TOUCHE"] != key)
      next
    
    ## Pour la bonne cle, si l'on rencontre un down, le stocker
    ##(ainsi, si l'on rencontre deux down de suite, sans up entre les deux,
    ##le premier sera effaee et seul le second comptera)
    if (data$K.EVENEMENT[i] == "Key Down") {
      down.time <- data$K.TEMPS[i]
      
    }  else {
      
      ## verifier si l'on a bien eu un down precedemment
      if (is.null(down.time)) {
        duration <- NA
        down.time <- NA
      } else{
        ## Calculer la duree entre down et up
        duration <- data$K.TEMPS[i] - down.time
        
      }
      
      ligne <- c(down.time, duration)
      results <- rbind (results, ligne)
      ## vider le down (en cas de deux up consecutifs, au cas ou)
      down.time <- NULL
    }
    
  }
  
  # 0 est considere comme FAUX on assigne que s'il y as des lignes
  if (nrow(results)){
    results$touche <- key
  }
  names (results) <- c ("down.time", "duration", "touche")
  return(results)
}

and the dplyr way:

tmp<-group_by(filter (data,K.EVENEMENT  == "Key Up"), K.TOUCHE)$K.TEMPS - group_by(filter (data,K.EVENEMENT  == "Key Down"), K.TOUCHE)$K.TEMPS

score 1 · Answer 1 · answered Apr 11 '20 at 12:27

For sure you should never write a loop yourself through a data.frame. There are a lot of packages and functions you can use to manipulate data in R.

I see that you are only at the beginning of your R journey. It is a wonderly advanture my friend.

score 1 · Accepted Answer · answered Apr 11 '20 at 13:26

This is not like a full answer but more like an extended comment. Disclaimer, I use dplyr etc a lot for data manipulation.

I noticed you are iterating through each item in your column, and slowly appending the result to a vector. This is problematic because it is under growing an object and failing to vectorize.

Not very sure what is your intended output from your code, and I am making a guess below looking at your dplyr function. Consider the below where you can implement the same results using base R and dplyr:

library(microbenchmark)
library(dplyr)
set.seed(111)

data = data.frame(K.EVENEMENT=rep(c("Key Up","Key Down"),each=500),
K.TEMPS = rnorm(1000),K.TOUCHE=rep(letters[1:2],500))
data$K.EVENEMENT = factor(data$K.EVENEMENT,levels=c("Key Up","Key Down"))

dplyr_f = function(data){
group_by(filter (data,K.EVENEMENT  == "Key Up"), K.TOUCHE)$K.TEMPS - group_by(filter (data,K.EVENEMENT  == "Key Down"), K.TOUCHE)$K.TEMPS
}

spl_red = function(data)Reduce("-",split(data$K.TEMPS,data$K.EVENEMENT))

Looking at your dplyr function, the second term in group_by is essentially useless because it doesn't order or do anything, so we can simplify the function to:

dplyr_nu = function(data){
filter(data,K.EVENEMENT  == "Key Up")$K.TEMPS - filter (data,K.EVENEMENT  == "Key Down")$K.TEMPS
}

all.equal(dplyr_nu(data),dplyr_f(data),spl_red(data))
1] TRUE

We can look at the speed:

microbenchmark(dplyr_f(data),dplyr_nu(data),spl_red(data))

           expr      min        lq       mean    median        uq      max
  dplyr_f(data) 1466.180 1560.4510 1740.33763 1636.9685 1864.2175 2897.748
 dplyr_nu(data)  812.984  862.0530  996.36581  898.6775 1051.7215 4561.831
  spl_red(data)   30.941   41.2335   66.42083   46.8800   53.0955 1867.247
 neval cld
   100   c
   100  b 
   100 a

I would think your function can be simplified somehow with some ordering or simple split and reduce. Maybe there's a a more sophisticated use for dplyr downstream, the above is just for healthy discussion.

the group by is usefull because in some case they are 2 key down on different key and not grouping will cause probleme — Galyfray, Apr 11 '20 at 14:38

processing speed difference between for loop and dplyr

2 Answers2