I'm trying to run a function to multiply a dataset variable value by a scalar from another weight
dataset. The choice of scalar depends on the year and location, which corresponds to rows and columns in the weight
dataset. R, however, cannot return the object due to its size, which surprised me since it is only 15.6 Mb. There are >2 million records. When running it with example variables, a much higher size is shown (about 3 Tb).
new.value.fn <- function(variable, year, location, weight) {
row <- year - 2000
new.value <- variable * weight[row, location]
return(new.value)
}
variable <- rnorm(2000000, 900, 750)
variable <- ifelse(variable < 0, 0, variable)
year <- runif(2000000, min = 2001, max = 2015)
location <- runif(2000000, min = 1, max = 7)
weight <- matrix(runif(14*7, min = 1, max = 1.3), ncol=7)
gc()
new.value.fn(variable, year, location, weight)
# Error: cannot allocate vector of size 29802.3 Gb
gc()
new.value.fn(actual.var, actual.year, actual.location, actual.weight)
# Error: cannot allocate vector of size 15.6 Mb
Running gc()
beforehand as per this question's answers does not change this. What is more surprising is that R states that it can run up to nearly 3 GB of data, yet cannot handle 15.6 Mb, which is approximately the length of the original vector:
> memory.size()
[1] 28691.74
> object.size(variable)
16390984 bytes
My question is: why can't R allocate a vector much smaller than the actual memory size available? It may be related to the fact that the actual function also requires too much memory.
This computer has 32 GB of RAM (31.9 GB usable max). Further information about my computer and session:
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tictoc_1.0 stringr_1.4.0 readxl_1.3.1 readr_1.4.0 questionr_0.7.4 lubridate_1.7.10
[7] HeatStress_1.0.7 magrittr_2.0.1 forecast_8.14 dplyr_1.0.5 data.table_1.14.0 arsenal_3.6.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lattice_0.20-41 zoo_1.8-9 assertthat_0.2.1 digest_0.6.27 lmtest_0.9-38
[7] psych_2.0.12 utf8_1.2.1 mime_0.10 cellranger_1.1.0 R6_2.5.0 labelled_2.8.0
[13] ggplot2_3.3.3 highr_0.8 pillar_1.5.1 rlang_0.4.10 curl_4.3 rstudioapi_0.13
[19] miniUI_0.1.1.1 fracdiff_1.5-1 TTR_0.24.2 munsell_0.5.0 tinytex_0.30 shiny_1.6.0
[25] compiler_4.0.4 httpuv_1.5.5 xfun_0.21 pkgconfig_2.0.3 mnormt_2.0.2 tmvnsim_1.0-2
[31] urca_1.3-0 htmltools_0.5.1.1 nnet_7.3-15 tidyselect_1.1.0 tibble_3.1.0 quadprog_1.5-8
[37] fansi_0.4.2 crayon_1.4.1 later_1.1.0.1 grid_4.0.4 nlme_3.1-152 xtable_1.8-4
[43] gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1 scales_1.1.1 quantmod_0.4.18 cli_2.3.1
[49] stringi_1.5.3 promises_1.2.0.1 tseries_0.10-48 timeDate_3043.102 ellipsis_0.3.1 xts_0.12.1
[55] generics_0.1.0 vctrs_0.3.6 forcats_0.5.1 tools_4.0.4 glue_1.4.2 purrr_0.3.4
[61] hms_1.0.0 parallel_4.0.4 fastmap_1.1.0 colorspace_2.0-0 haven_2.3.1
When attempting the command with the reproducible example or actual data, the memory usage of my computer skyrockets from 14 GB to the max. This is likely related to the issue:
EDIT: Example weight is a matrix, but the actual.weight is a data frame. Changing the classes changes the error message size
new.value.fn(variable, year, location, as.data.frame(weight))
# Error: cannot allocate vector of size 15.3 Mb
new.value.fn(actual.var, actual.year, actual.location, as.matrix(actual.weight))
# Error: cannot allocate vector of size 31275.5 Gb
This enables showing of a vector size that does actually exceed the computer's capacity. This suggests that the 15.6 Mb vector was greatly underestimated by R. Though why matrix vs data frame makes such a large difference in the estimate I don't know (and I still need to determine how to carry out the function).