5

I often am faced with data that have too many categorical variables to satisfactorily plot on to one plot. When this situation arises, I write something to loop over a variable and save several plots specific to that variable.

This process is illustrated by the following example:

library(tidyr)
library(dplyr)
library(ggplot2)

mtcars <- add_rownames(mtcars, "car")

param<-unique(mtcars$cyl)
for (i in param)
{
mcplt <- mtcars %>% filter(cyl==i) %>% ggplot(aes(x=mpg, y=hp)) +
    geom_point() +
    facet_wrap(~car) +
    ggtitle(paste("Cylinder Type: ",i,sep=""))
  ggsave(mcplt, file=paste("Type",i,".jpeg",sep=""))
}

Whenever, I see references online to looping though, everyone always seems to indicate that looping usually not a good strategy in R. If this is the case, can anyone recommend a better way of achieving the same result as above? I'd be interested particularly in something faster as loops as SOOOO slow. But maybe the solution is that this is the best solution. I was just curious if anyone could improve upon this.

Thanks in advance.

boshek
  • 4,100
  • 1
  • 31
  • 55
  • 1
    I think your loop is great as written. Loops get a lot of very unfair bad publicity in R. – bdemarest Dec 10 '15 at 00:11
  • Agree that loop is ok. The code loads tidyr but does not use it; loop indentation could be improved and `paste(..., sep = "")` is better written as `paste0(...)` or use `sprintf`. – G. Grothendieck Dec 10 '15 at 00:18
  • 1
    Loops are frequently discouraged in R because many of the common loop-style functions are built in as base R functions and should be used as such, but because those doesn't exist in lower-level languages, programmers with another language background might tend to go for the loops even on simple situatuins like calculating the sum of elements of a vector/array. Of course that this can be extended to much more complicated cases. But I agree that using a `for` loop to generate several (graphically) identical plots of different variables is a perfectly fine use, as I do it myself. – Molx Dec 10 '15 at 00:54
  • I do believe this question is more about why loops are (said to be) a bad idea in R, rather than the specific case about plots though. – Molx Dec 10 '15 at 01:02
  • 1
    I think a loop works fine in this case. You could use `lapply` also. It seems to me that a more common case of loops (rightly) getting a bad rap is the use of a loop instead of vectorization. – eipi10 Dec 10 '15 at 04:47
  • Using loops is not wrong per sé, but they can be very inefficient and lead to bulky/unreadable code. In this case, it seems quite efficient and allows you to subset and create filenames. – Heroka Dec 10 '15 at 08:49

1 Answers1

4

This is a well thought about topic for R, see SO posts here and here. Answers to this question highlight that *apply() alternatives to for() improve clarity, make parallelization easier, and under some circumstance speed up the problem. However, presumably your real question is ''how do I do this faster'' because it is taking long enough that you're unhappy. Inside your loop you are doing 3 distinct tasks.

  1. Break out a chunk of the dataframe using filter()
  2. Make a plot.
  3. Save the plot to a jpeg.

There are multiple ways to do all three of these steps, so let's try and evaluate all of them. I'll use the diamonds data from ggplot2 because it is bigger than the cars data. I hope differences in performance between methods will be noticeable this way. I learned alot from this chapter of Hadley Wickham's book on measuring performance.

So that I can use profiling I put the following block of code inside a function, and save that in a separate R file named for_solution.r.

f <- function(){
  param <- unique(diamonds$cut)
  for (i in param){
    mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) +
      geom_point() +
      facet_wrap(~color) +
      ggtitle(paste("Cut: ",i,sep=""))
    ggsave(mcplt, file=paste("Cut",i,".jpeg",sep=""))
  }
}

and then I do:

library(dplyr)
library(ggplot2)
source("for_solution.r",keep.source=TRUE)
Rprof(line=TRUE)
f()
Rprof(NULL)
summaryRprof(lines="show")

Examining that output I see that the block of code is spending 97.25% of the time just saving the files. Examining the source for ggsave() I can see that function is doing alot of defensive programming to identify the type of output, then opening the graphics device, printing, and then closing the device. So I wonder if doing just that step manually would help. I'm also going to take advantage of the fact that a jpeg device will automatically produce new files for each page to only open and close the device once.

f1 <- function(){
  param <- unique(diamonds$cut)
  jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave()
  for (i in param){
    mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) +
      geom_point() +
      facet_wrap(~color) +
      ggtitle(paste("Cut: ",i,sep=""))
    print(mcplt)
  }
  dev.off()
}

and now profiling again

Rprof(line=TRUE)
f1()
Rprof(NULL)
summaryRprof(lines="show")

f1() still spends most of it's time on print(mcplt), and it is slightly faster than before (1.96 seconds compared to 2.18 seconds). One possible way to speed things up is to use a smaller device (less resolution or smaller image); when I used the defaults for jpeg() the difference was larger, more like 25% faster. I also tried changing the device to png() but that was no different.

Based on the profiling, I don't expect this to help, but for completeness I'm going to try doing away with the for loop and running everything inside dplyr with do(). I found this question and this one helpful here.

jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave()
plots = diamonds %>% group_by(cut) %>% 
  do({plot=ggplot(aes(x=carat, y=price),data=.) +
      geom_point() +
      facet_wrap(~color) +
      ggtitle(paste("Cut: ",.$cut,sep="")) 
    print(plot)})

dev.off()

Running that code gives

Error: Results are not data frames at positions: 1, 2, 3

but it seems to work. I believe the error arises when do() returns because the print() method isn't returning a data.frame. Profiling it seems to indicate it runs a bit faster, 1.78 seconds overall. But I don't like solutions that generate errors, even if they aren't causing problems.

I have to stop here, but I've already learned a great deal about where to focus the attention. Other things to try would include:

  1. Using parallel or something similar to run each chunk of the dataframe in a separate process. I'm not sure that would help if the problem is saving the file, but if rendering the image is done by the CPU it would, I think.
  2. Try data.table instead of dplyr, but again, it's the printing part that's slow.
  3. Try Base graphics and lattice graphics and plotly instead of ggplot2. I've no idea about the relative speed, but it could vary.
  4. Buy a faster hard drive! I just compared the speed of f() on my home computer with a regular hard drive to my work machine with an SSD -- it's about 3x slower than the timings above.
Community
  • 1
  • 1
atiretoo
  • 1,812
  • 19
  • 33