Making my code more efficient in R

Question

I am trying to execute a code that takes way too much time (>6 days). Maybe there is a way of making it more efficient. Any ideas?

library(haven)
library(plyr)
AFILIAD1 <- read_sav("XXXX")
#this sav has around 6 million rows.

AFILIAD1$F_ALTA<- as.character(AFILIAD1$F_ALTA)
AFILIAD1$F_BAJA<- as.character(AFILIAD1$F_BAJA)


AFILIAD1$F_ALTA <- as.Date(AFILIAD1$F_ALTA, "%Y%m%d")
AFILIAD1$F_BAJA <- as.Date(AFILIAD1$F_BAJA, "%Y%m%d")
#starting and ending date

meses <- seq(as.Date("1900-01-01"), as.Date("2014-12-31"), by = "month")

#this is the function that needs to be more efficient 
ocupados <- function(pruebas){
 previo <- c()
 total <- c()
   for( i in 1:length(meses)){
     for( j in 1:nrow(pruebas)){
       ifelse(pruebas$F_ALTA[j] <= meses[i]  & pruebas$F_BAJA[j] >= 
       meses[i], previo[j]<- pruebas$IPF[j],previo[j]<- NA)
      }
    total[i] <- (length(unique(previo))-1)
   }
  names(total)<-meses
  return(total)
}

#this takes >6 days to execute
afiliado1 <- ocupados(AFILIAD1)

Care to elaborate on what your code actually does? Or do you expect us to work that out ourselves? Without any sample data? And without expected output? Please review [how to ask](https://stackoverflow.com/help/how-to-ask) questions, and then provide a [minimal reproducible example/attempt](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), including sample data. — Maurits Evers, Apr 22 '18 at 22:04

Melissa Key · Accepted Answer · 2018-04-22T23:36:51.863

3

There is a lot you can do to speed this up. Here's one example:

library(tidyverse) % adds pipes
ocupados <- function(pruebas) {
  total <- map_int(meses, function(x) {
    with(pruebas, {
      IPF[F_ALTA <= x & F_BAJA >= x] %>%
        n_distinct() #I'm assuming you subtract 1 to remove the NA effect - no longer needed
    })
  })
  names(total) <- meses
  return(total)
}

There are two big speed ups here. First, the inner loop is implemented in compiled code (so you don't see it here), which will be huge savings for you.
Second, we never define empty vectors. Those empty vectors have to be copied EVERY time you increase the length - which is very expensive. Instead, all I'm saving is the final result. The apply family of functions behave like loops, but implement the code in a function.

If you're not familiar with the pipe operator (%>%), all it does is call the next function with the result from the previous function as the next argument. So

length(unique(x))

is the same as

x %>%
  unique() %>%
  length()

The advantage is readability - it's easier to see that I apply unique, then length using the pipe.

One more comment - without a reproducible example, I cannot test this code. If you have trouble, you need to include a small reproducible data set so we can actually test what the code is doing.

edited Apr 22 '18 at 23:36

answered Apr 22 '18 at 23:12

Melissa Key

4,476
12
21

if using `tidyverse` you could also just use `n_distinct()` – zacdav Apr 22 '18 at 23:17
good point! I don't have all the commands in there memorized yet - there are a ton! Changed! – Melissa Key Apr 22 '18 at 23:19
Also `sapply(meses, function(mese) ...)` so that one does not need to index with `i` or set `names<-()`; ` vapply()` is usually more robust than `sapply()`. – Martin Morgan Apr 22 '18 at 23:28
changes add, Martin Morgan. I don't usually use `vapply` (actually, I usually use `purrr` these days, but I figured I'd keep this simpler). Maybe I should just change to `map_int` so there isn't that 0 at the end (which I still find to be weird). – Melissa Key Apr 22 '18 at 23:39
Melissa, you did the trick! One sample that i had for testing (300 rows) took 52 seconds with my code. With your code it took 0.32 seconds. I have other question open which has not been answer and that i think is pretty dificult to do so but maybe you want to try (not sure if linking to other questions may be against the rules but it is really causing me a headache). It is, somehow related to this topic. https://stackoverflow.com/questions/49949500/filling-missing-dates-in-r – Juan Carbonell Apr 23 '18 at 00:04
Good. the speedup should be greater the bigger your example due to the empty vectors. I'd be curious to know how long it takes to do the whole thing - it won't take anywhere near 6 days! – Melissa Key Apr 23 '18 at 00:28

Making my code more efficient in R

1 Answers1