1

I'm working with a >1 GB data set and running into out of memory ("Cannot allocate...") errors in ggplot2 graphing. In trying to research where all my memory is going (with the help of sources like this and this and this, I've discovered that the following code with dummy data causes significant memory usage that appears to be unclaimed in the Windows Task Manager even after repeated calls to gc().

print(begMemSize <- memory.size())

library(ggplot2)
numRows <- 1e6
df <- data.frame( x1 = runif(numRows), x2 = runif(numRows), xGroup = factor(trunc(runif(numRows, 1, 6))) )
df$y = df$x1 + df$x2

gc()
print(mid1MemSize <- memory.size())

# This is fine
ggplot( data = df, mapping = aes( x = x1)) +
  geom_smooth( mapping = aes( y = y))

gc()
print(mid2MemSize <- memory.size())

# This makes memory.size() explode
ggplot( data = df, mapping = aes( x = x1)) +
  geom_smooth( mapping = aes( y = y)) +
  geom_hline( mapping = aes( yintercept = 0.25))

gc()
print(endMemSize <- memory.size())

The expression c( begMemSize, mid1MemSize, mid2MemSize, endMemSize) returns:

[1]   50.62  102.30  199.22 1208.39

Note the huge jump in the last number. That last number matches readings in Windows Task Manager (very close to "Memory (active working set)" and only slightly lower than "Commit size" in the Details tab). Sometimes, with repeated calls to gc() I can get memory.size() to go down in R but not the readings in the Windows Task Manager. I worry that my out-of-memory errors are related to this, but my immediate questions are:

  1. Why is this happening?
  2. Is there any way to get the Windows Task Manager memory readings to go down in this situation (without, obviously, closing R and losing all the data processing in memory)?

sessionInfo() output (using RStudio 1.3.1056):

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.3.2

loaded via a namespace (and not attached):
 [1] rstudioapi_0.11  magrittr_1.5     splines_4.0.2    tidyselect_1.1.0 munsell_0.5.0    colorspace_1.4-1 lattice_0.20-41  R6_2.4.1         rlang_0.4.6      dplyr_1.0.0      tools_4.0.2      grid_4.0.2      
[13] gtable_0.3.0     nlme_3.1-148     mgcv_1.8-31      withr_2.2.0      ellipsis_0.3.1   digest_0.6.25    tibble_3.0.1     lifecycle_0.2.0  crayon_1.3.4     Matrix_1.2-18    farver_2.0.3     purrr_0.3.4     
[25] vctrs_0.3.1      glue_1.4.1       labeling_0.3     compiler_4.0.2   pillar_1.4.4     generics_0.0.2   scales_1.1.1     pkgconfig_2.0.3 
Joel Buursma
  • 118
  • 6
  • Another odd thing about this example is that the value returned by `endMemSize` is much higher than the output of the final call to `gc()`. – Joel Buursma Aug 03 '20 at 20:17
  • Can you post the output of `sessionInfo()` too? – Tung Aug 03 '20 at 20:24
  • Your code works on my PC: Windows 10, 16GB RAM, R 4.0.2 64-bit & ggplot 3.3.2 – Tung Aug 03 '20 at 20:30
  • Relevant: https://github.com/tidyverse/ggplot2/issues/3249 & https://github.com/tidyverse/ggplot2/issues/3008 & https://github.com/tidyverse/ggplot2/issues/3997 – Tung Aug 03 '20 at 20:30
  • Thanks for your response. I added the `sessionInfo()` output. Looks like I'm using the same version of R & ggplot2 as you are. When you say that my code "works" on your PC, are you saying that you don't see the big spike in memory usage at the end that you can't get down? – Joel Buursma Aug 03 '20 at 21:00
  • I saw a big spike of memory usage from about 200 MB to 1.4 GB but the PC didn't run out of memory (the plot was created) – Tung Aug 04 '20 at 01:06
  • Right. If you want to actually run out of memory, just increase numRows to something like 1e8. Then, on my machine, the first graph completes and the second says, "Error: memory exhausted (limit reached?)" and "Computation failed in `stat_smooth()`: cannot allocate vector of size 7.5 Gb". But why does merely adding a horizontal line to the graph (geom_hline) cause this behavior? Why is it trying to allocate 7.5 GB when df is only 2.6 GB? How do I make Task Manager's Memory reading go down afterwards? Enlightenment on any of these questions would be most welcome. – Joel Buursma Aug 04 '20 at 13:47
  • These do not make Task Manager's Memory reading go down: `gc()`, `rm( list = ls())`, "Clear all Plots" (RStudio), "Clear all History entries", "Clear all objects from workspace". Restarting R does: `.rs.restartR()`. – Joel Buursma Aug 04 '20 at 16:02
  • I discovered that this variant of the problematic command is fine: `ggplot() + geom_smooth( data = df, mapping = aes( x = x1, y = y)) + geom_hline( mapping = aes( yintercept = 1))`. Task Manager goes up to a peak of 1.9 GB, but a call to `gc()` afterwards brings it down. Note that the initial `ggplot()` function has no variables. – Joel Buursma Aug 10 '20 at 14:13
  • And this simpler variant is bad: `ggplot() + geom_hline( data = df, mapping = aes( yintercept = 1))`. Task Manager spikes to almost 1.4 GB, but then won't come down no matter how many `gc()` or `rm( list = ls())` calls I do. So it's not geom_smooth's fault at all -- just geom_hline's! – Joel Buursma Aug 10 '20 at 14:22

1 Answers1

4

This is just a partial answer, with regards to one aspect of the problem.

When you put yintercept inside the aes() function, you're instructing ggplot2 to map the yintercept aesthetic to every row in the data argument. Hence, the geom_hline() layer transforms it's data into a large data.frame containing many rows. If you do not put it inside the aes() function, but use it as a normal argument to the layer, the layer data will remain small. See example below.

library(ggplot2)
numRows <- 1e6
df <- data.frame( x1 = runif(numRows), x2 = runif(numRows), xGroup = factor(trunc(runif(numRows, 1, 6))) )
df$y = df$x1 + df$x2

p <- ggplot( data = df, mapping = aes( x = x1)) +
  geom_smooth( mapping = aes( y = y))

p_mapped <- p + geom_hline(mapping = aes(yintercept = 0.25))
p_unmapped <- p + geom_hline(yintercept = 0.25)

layer_mapped <- layer_data(p_mapped, 2)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
layer_unmapped <- layer_data(p_unmapped, 2)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

(format(object.size(layer_mapped), units = "Mb"))
#> [1] "42 Mb"
(format(object.size(layer_unmapped), units = "Kb"))
#> [1] "2.2 Kb"

Note that while the size of the layer data does not account for a lot of the used memory, one has to keep in mind that many things are computed on the data, from calculating axes limits, to applying alpha to colours.

Furthermore, by running your example and changing the place where yintercept was defined from the mapping argument to a regular argument to the layer, the endMemSize was around ~200Mb for me.

Lastly, ggplot2 keeps a copy of the last plot in it's namespace, which is not visible to users. However, you could use set_last_plot(NULL) to free up some extra memory.

teunbrand
  • 33,645
  • 4
  • 37
  • 63
  • This is correct. In addition, the mapped version doesn't draw just one line, it draws a million lines on top of one another. You may want to amend your answer to clarify this. – Claus Wilke Aug 10 '20 at 16:30
  • I looked at `GeomHline$draw_panel()` and it seems to reduce the data by calling `unique()`. The gtable objects derived from `p_mapped` and `p_unmapped` are very similar in size. – teunbrand Aug 10 '20 at 16:32
  • Ah, maybe this has changed recently. In the past we had a lot of issues with multiply drawn lines. – Claus Wilke Aug 10 '20 at 16:34
  • Thank you for the explanation! It looks like the the data parameter can determine the x axis labels for `geom_hline`, but that is apparently unnecessary in my reprex b/c I could just pass the `df` only to `geom_smooth`. So that's a good workaround. Still, it's odd to me that the reprex leaves Task Manager's memory usage > 1GB even after a `set_last_plot(NULL)` call. Related GitHub issue: https://github.com/tidyverse/ggplot2/issues/4167. – Joel Buursma Aug 10 '20 at 17:35
  • I'll agree that it is odd. I also get R reporting smaller memory usage than the task manager reports and I don't know what is causing this. – teunbrand Aug 10 '20 at 17:56