1

I have a function that 1) loads some large CSV files 2) processes those datasets and 3) puts them into a list and returns the list object. It looks something like this

library(data.table)

load_data <- function(){

  # Load the data
  foo <- data.table(iris)  # really this:  foo <- fread("foo.csv")
  bar <- data.table(mtcars)  # really this:  bar <- fread("bar.csv")

  # Clean the data
  foo[, Foo := mean(Sepal.Length) + median(bar$carb)]
  # ... lots of code here

  # Put datasets into a list
  datasets <- list(foo = foo[], bar = bar[])

  # Return result
  return(datasets)
}

My concern is that, when I build the list object, I am doubling the required memory because I'm basically creating a duplicate copy of each dataset.

  1. Is my assumption correct?
  2. If my assumption is correct, is it possible to assign my objects to a list without duplicating them? One possible solution is to load these objects into a list from the getgo (e.g. datasets <- list(foo = fread("foo.csv"), bar = fread("bar.csv"))) but this is undesirable because the code becomes lengthy and messy, constantly using datasets$foo and datasets$bar.
Ben
  • 20,038
  • 30
  • 112
  • 189
  • Your assumption is not correct. Generally in R data is only copied when it's contents are modified. Just putting them into a list should not copy them. – MrFlick Aug 03 '18 at 19:32
  • 2
    More details here: https://stackoverflow.com/questions/15759117/what-exactly-is-copy-on-modify-semantics-in-r-and-where-is-the-canonical-source – MrFlick Aug 03 '18 at 19:33

1 Answers1

3

You might want to look into Hadley's resource on memory usage in R here, but as a quick illustration:

library(pryr)
mem_used()
#> 36.1 MB
foo <- iris
bar <- mtcars
mem_used() # Loading the datasets into objects requires some memory
#> 36.4 MB
foo["Foo"] <- mean(foo$Sepal.Length) + median(bar$carb)
mem_used()
#> 36.6 MB # Modifying requires some more memory
foo_list <- list(foo)
mem_used() # Adding to the list doesn't really (it's a few bytes)
#> 36.6 MB

Created on 2018-08-03 by the reprex package (v0.2.0).

Calum You
  • 14,687
  • 4
  • 23
  • 42