3

I can't find any information about this and am not sure what other keywords I could google, so apologies if this is a duplicate.

I have some some lists of data.tables in my workspace, as displayed here:

> lsos()   
                       Type     Size PrettySize Rows Columns
all_subsets            list 46673512    44.5 Mb    3      NA
glm_Macro.part_1       list 15817064    15.1 Mb    2      NA
glm_Macro.part_2       list 15817064    15.1 Mb    2      NA
glm_Macro.part_3       list 15289864    14.6 Mb    2      NA

I then need to save the last three items in the list to disk. I do this simply using save() and the .rdaextension, e.g.

save(glm_Macro.part_1, file = "glm_Macro.part_1.rda")

Looking on the disk, however, the size of the three respective files are 270.7, 268.8 and 262.6 MB. This is ~18 times larger.

Is there a known reason for this?

My only hunch is the way data.table uses referencing, meaning data is not copied, rather just referenced from the original data set. See here for an example of how that works. So when I save the data to disk, maybe it forces the copying of all data.tables, where referencing was doing enough within the R workspace.

Terminal, Rstudio and ESS (Emacs) all show the same sizes in the workspace, so it is not related to the environment it seems.

Community
  • 1
  • 1
n1k31t4
  • 2,745
  • 2
  • 24
  • 38
  • 1
    What's the `lsos()` function? Why is the Columns column NA? Why is the Type column `list` and not `data.table`? 2 and 3 rows seem very tiny table so to get 44Mb size are there a large number of columns? – Matt Dowle Jan 16 '16 at 01:38
  • Oh so they're lists of data.table. Please read [Support](https://github.com/Rdatatable/data.table/wiki/Support) and simplify your example i.e. make it minimal. In this case, take one of those data.table's, save it and compare the size and just report that, first. – Matt Dowle Jan 16 '16 at 01:43
  • @Matt Dowle - lsos() is a function from the multilevelPSA package, which is more helpful than simply using ls() to inspect variables in the workspace. – n1k31t4 Jan 16 '16 at 01:43
  • Ok I will look into to this and get back to you. Out of interest, have you encountered anything like this before? – n1k31t4 Jan 16 '16 at 01:44
  • Haven't seen an 18 times expansion before. Please do provide single table example. In this case you don't need to make it fully reproducible (a small example might not exhibit the problem). Providing the column types using `str(DT)` would be ok for starters. – Matt Dowle Jan 16 '16 at 02:18

1 Answers1

0

I don't think this is related to data.table, rather base R. Which causes the save file of glm objects to be large/huge in some cases.

From the names of your models I am going to guess that you are fitting your glm models inside a function call. Since the output of glm contains a formula created inside the function environment, and formulas capture the environment the file you save will contain that function environment.

Compare:

library(multilevelPSA)

test_in_env <- function(){
  bloat <- rnorm(10000000)

  clotting <- data.frame(
    u = c(5,10,15,20,30,40,60,80,100),
    lot1 = c(118,58,42,35,27,25,21,19,18),
    lot2 = c(69,35,26,21,18,16,13,12,12))
  glm(lot1 ~ log(u), data = clotting, family = Gamma)
}

test.glm <- test_in_env()
lsos()
# Type  Size PrettySize Rows Columns
# test.glm         glm 94936    92.7 Kb   30      NA
# test_in_env function 12008    11.7 Kb   NA      NA
# GCtorture    logical    48   48 bytes    1      NA


save(test.glm, file = "glm_env_local.Rda")
# 75 Mb file created

Which captures the local environment, including the bloat vector in the saved file. In the global case, without the function:

bloat <- rnorm(10000000)

clotting <- data.frame(
  u = c(5,10,15,20,30,40,60,80,100),
  lot1 = c(118,58,42,35,27,25,21,19,18),
  lot2 = c(69,35,26,21,18,16,13,12,12))
test.glm <- glm(lot1 ~ log(u), data = clotting, family = Gamma)

lsos()
# bloat                numeric 80000040    76.3 Mb 1e+07      NA
# test.glm                 glm    94936    92.7 Kb 3e+01      NA
# test_in_env         function    12008    11.7 Kb    NA      NA
# clotting          data.frame     1280     1.2 Kb 9e+00       3
# local.env.formula    formula      880  880 bytes 3e+00      NA
# GCtorture            logical       48   48 bytes 1e+00      NA

save(test.glm, file = "glm_env_global.Rda")
## 5 Kb file

The save won't include the enclosing environment in the save, and thus be a size reflected by lsos. It is possible to remove the references in the glm-output, as well as other bloat in that object.

A related problem with environments can be found here. And explanation of the enclosing function environment, see Hadley Wickham's description.