3

How do I make multiple plots of the same data but colored differently by different factors (columns) while recycling data? Is this what gridExtra does differently than cowplot?

Objective: My objective is to visually compare different results of clustering the same data efficiently. I currently believe the easiest way to compare 2-4 clustering algorithms visually is to have them plotted next to each other.

Thus, how do I plot the same data side by side colored differently?

Challenge/Specifications: Performance is very important. I have roughly 30,000 graphs to make, each with 450 - 480 points. It is critical that the data is "recycled."

I am able to plot them side by side using packages cowplot and gridExtra. I just started using gridExtra today but it seems to recycle data and is better than cowplot for my purposes. Update: u/eipi10 demonstrated facet_wrap could work if I gathered the columns before plotting.

Set up

    #Packages
     library(ggplot2)
     library(cowplot)
     library(gridExtra)
     library(pryr) #memory profile

    #Data creation
      x.points  <- c(1, 1, 1, 3, 3, 3, 5, 5, 5)
      y.points  <- c(1, 3, 5, 1, 3, 5, 1, 3, 5)
      cl_vert   <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
      cl_hoz    <- c("A", "B", "C", "A", "B", "C", "A", "B", "C")
      cl_cent   <- c("A","A","A","A", "B", "A","A","A","A")
    df <- data.frame(x.points, y.points, cl_vert, cl_hoz, cl_cent)

Graphing them

    #Graph function and individual plots
     graph <- function(data = df, Title = "", color.by, legend.position = "none"){
       ggplot(data, aes(x = `x.points`, y = `y.points`)) +
         geom_point(aes(color = as.factor(color.by))) + scale_color_brewer(palette = "Set1") + 
         labs(subtitle = Title, x = "log(X)", y = "log(Y)", color = "Color" ) + 
         theme_bw() + theme(legend.position = legend.position)  
     }

     g1 <- graph(Title = "Vertical", color.by = cl_vert)
     g2 <- graph(Title = "Horizontal", color.by = cl_hoz)
     g3 <- graph(Title = "Center", color.by = cl_cent)

    #Cowplot
     legend <- get_legend(graph(color.by = cl_vert, legend.position = "right")) #Not a memory waste
     plot <- plot_grid(g1, g2, g3, labels = c("A", "B", "C"))
     title <- ggdraw() + draw_label(paste0("Data Ex ", "1"), fontface = 'bold') 
     plot2 <- plot_grid(title, plot, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
     plot3 <- plot_grid(plot2, legend, rel_widths = c(1, 0.3))
     plot3

    #gridExtra
     plot_grid.ex <- grid.arrange(g1, g2, g3, ncol = 2, top = paste0("Data Ex ", "1"))
     plot_grid.ex

Memory usage with pryr

    #Comparison
     object_size(plot_grid) #315 kB 
     object_size(plot3) #1.45 MB
    #Individual objects
     object_size(g1) #756 kB
     object_size(g2) #756 kB
     object_size(g3) #756 kB
     object_size(g1, g2, g3) #888 kB
     object_size(legend) #43.6 kB

Additional Questions: After writing this question and providing sample data, I just remembered gridExtra, tried it, and it seems to take up less memory than the combined data of its component graphs. I thought g1, g2, and g3 shared the same data except for the coloring assignment, which was why there was roughly 130 kB difference between the individual components and the total object size. How is it that plot_grid takes up even less space than that? ls.str(plot_grid) doesn't seem to show any consolidation of g1, g2, and g3. Would my best bet be to use lineprof() and run line by line comparisons?

Sources I've skimmed/read/consulted:

Please bear with me as I am a new programmer (just truly started scripting December); I don't understand all the technical details yet but I want to.

A Duv
  • 393
  • 1
  • 17
  • How are you going to look at 30,000 plots? Maybe there is a better way to do what you want to do? – Ian Wesley Apr 19 '18 at 19:32
  • I'm not going to do all 30,000 sets of data at once. I'm also not going to save the graphs themselves, but the data used to make them. I actually have the data in identical format spread across ~500 folders (categorized by the # of clusters and p_value given from HDBSCAN). I load in one csv file at a time. That csv can have 2 sets of data to 200. I use tidyverse `group_by()` and `nest()` to form a nested list, then run lapply to make all the graphs. I then call one graph at a time (I have a shiny app to both navigate the directories and display the proper graph) – A Duv Apr 19 '18 at 20:56

2 Answers2

1

Faceting will work here if you convert your data to long format. Here's an example:

library(tidyverse)

df %>% gather(method, cluster, cl_vert:cl_cent) %>% 
  ggplot(aes(x = x.points, y = y.points)) + 
    geom_point(aes(color = cluster)) + 
    scale_color_brewer(palette = "Set1") + 
    theme_bw() +
    facet_wrap(~ method)

enter image description here

eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Actually, I didn't think of converting the data to long format! This is a potential fix. – A Duv Apr 19 '18 at 20:14
  • `ggplot2` faceting and aesthetics are designed to work most effectively with long-format data. – eipi10 Apr 19 '18 at 20:23
  • This is a lot faster than cowplot. Thank you so much! I compared the gridExtra to the above facet method and this was 54.593x faster. – A Duv Apr 19 '18 at 22:10
  • 1
    `gridExtra::grid.arrange` and `cowplot::plot_grid` are mainly for laying out *separate* plots together on a single "canvas", rather than a substitute for faceting (when faceting makes sense). – eipi10 Apr 19 '18 at 22:17
0

If you're after a boost in performance don't use any of those packages, including ggplot2. gridExtra, cowplot and others will always make things slower, and they do not "recycle" data in any sense (it's not clear what you mean by this).

I would recommend doing all the time-consuming data processing outside ggplot2, and drawing the results that are already much closer to the final mapping (i.e. groups of colours are already assigned, etc.). You may find that ggplot2 then becomes overkill and slow for your application (lattice is typically faster, and so is base plot).

If you actually want shared data, I would think something like d3.js might get you closest to this goal, although it would simply leave the data duplication for the browser to do at rendering time. Once the data points are rendered on screen, they have to be independent and duplicated, so the question is where in the pipeline is this most convenient and efficient to do.

  • I'm interested in a boost in performance. I will look into lattice and base plot. How would I assign the groups of colours before running the plot? One of my concerns is that I actually customized the shape/size of some of the points by an additional column (based upon a conditional outside the graph) so I'm worried about translating all of that back into lattice/base. I thought recycled data was when R references the same object again to save memory. ex: x <- 1e6; y <- x; pryr::object_size(x); pryr::object_size(x, y) – A Duv Apr 19 '18 at 20:36
  • since you mention that you're relatively new to programming I would strongly advise that you don't worry too much about object sizes and performance (to a point). R is very useful for interactive analysis, and when the problem at hand has a somewhat larger scale than usual your first strategy should be brute-force increase in computer resources: how long would it take to just use ggplot2 to produce those plots? Overnight? A weekend? Rent a HPC machine? Whatever gets the job done will be better than trying to bypass the entire ecosystem you're learning. – user9671660 Apr 19 '18 at 21:57
  • Thank you for your insight; I do think an increase in computer resources would fix everything. The answer by eipi10 sped things up to an almost acceptable speed on my work computer. I know R is good for interactive analysis, but I need to speed up some analysis. I used to have access to grid computing at work but the servers crashed and no longer support the tools I need. Due to a complicated work situation, I can't upgrade my hardware. If I can't improve these results, then my supervisor wants me to manually cluster these 30k graphs using an inefficient program... – A Duv Apr 19 '18 at 22:20