This is a well thought about topic for R, see SO posts here and here. Answers to this question highlight that *apply()
alternatives to for()
improve clarity, make parallelization easier, and under some circumstance speed up the problem. However, presumably your real question is ''how do I do this faster'' because it is taking long enough that you're unhappy. Inside your loop you are doing 3 distinct tasks.
- Break out a chunk of the dataframe using
filter()
- Make a plot.
- Save the plot to a jpeg.
There are multiple ways to do all three of these steps, so let's try and evaluate all of them. I'll use the diamonds data from ggplot2 because it is bigger than the cars data. I hope differences in performance between methods will be noticeable this way. I learned alot from this chapter of Hadley Wickham's book on measuring performance.
So that I can use profiling I put the following block of code inside a function, and save that in a separate R file named for_solution.r.
f <- function(){
param <- unique(diamonds$cut)
for (i in param){
mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) +
geom_point() +
facet_wrap(~color) +
ggtitle(paste("Cut: ",i,sep=""))
ggsave(mcplt, file=paste("Cut",i,".jpeg",sep=""))
}
}
and then I do:
library(dplyr)
library(ggplot2)
source("for_solution.r",keep.source=TRUE)
Rprof(line=TRUE)
f()
Rprof(NULL)
summaryRprof(lines="show")
Examining that output I see that the block of code is spending 97.25% of the time just saving the files. Examining the source for ggsave()
I can see that function is doing alot of defensive programming to identify the type of output, then opening the graphics device, printing, and then closing the device. So I wonder if doing just that step manually would help. I'm also going to take advantage of the fact that a jpeg device will automatically produce new files for each page to only open and close the device once.
f1 <- function(){
param <- unique(diamonds$cut)
jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave()
for (i in param){
mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) +
geom_point() +
facet_wrap(~color) +
ggtitle(paste("Cut: ",i,sep=""))
print(mcplt)
}
dev.off()
}
and now profiling again
Rprof(line=TRUE)
f1()
Rprof(NULL)
summaryRprof(lines="show")
f1()
still spends most of it's time on print(mcplt)
, and it is slightly faster than before (1.96 seconds compared to 2.18 seconds). One possible way to speed things up is to use a smaller device (less resolution or smaller image); when I used the defaults for jpeg()
the difference was larger, more like 25% faster. I also tried changing the device to png()
but that was no different.
Based on the profiling, I don't expect this to help, but for completeness I'm going to try doing away with the for loop and running everything inside dplyr with do()
. I found this question and this one helpful here.
jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave()
plots = diamonds %>% group_by(cut) %>%
do({plot=ggplot(aes(x=carat, y=price),data=.) +
geom_point() +
facet_wrap(~color) +
ggtitle(paste("Cut: ",.$cut,sep=""))
print(plot)})
dev.off()
Running that code gives
Error: Results are not data frames at positions: 1, 2, 3
but it seems to work. I believe the error arises when do()
returns because the print() method isn't returning a data.frame. Profiling it seems to indicate it runs a bit faster, 1.78 seconds overall. But I don't like solutions that generate errors, even if they aren't causing problems.
I have to stop here, but I've already learned a great deal about where to focus the attention. Other things to try would include:
- Using
parallel
or something similar to run each chunk of the dataframe in a separate process. I'm not sure that would help if the problem is saving the file, but if rendering the image is done by the CPU it would, I think.
- Try data.table instead of dplyr, but again, it's the printing part that's slow.
- Try Base graphics and lattice graphics and plotly instead of ggplot2. I've no idea about the relative speed, but it could vary.
- Buy a faster hard drive! I just compared the speed of f() on my home computer with a regular hard drive to my work machine with an SSD -- it's about 3x slower than the timings above.