Keyed assignment should save memory.
dt1[dt2, on = "id", x5 := x5]
Should we use a DB library to get this done?
That's probably a good idea. If setting up and using a database is painful for you, try the RSQLite
package. It's pretty simple.
My experiment
tl;dr: 55% less memory used by keyed assignment compared to merge-and-replace, for a toy example.
I wrote two scripts that each sourced a setup script, dt-setup.R
to create dt1
and dt2
. The first script, dt-merge.R
, updated dt1
through the "merge" method. The second, dt-keyed-assign.R
, used keyed assignment. Both scripts recorded memory allocations using the Rprofmem()
function.
To not torture my laptop, I'm having dt1
be 500,000 rows and dt2
3,000 rows.
Scripts:
# dt-setup.R
library(data.table)
set.seed(9474)
id_space <- seq_len(3000)
dt1 <- data.table(
id = sample(id_space, 500000, replace = TRUE),
x1 = runif(500000),
x2 = runif(500000),
x3 = runif(500000),
x4 = runif(500000)
)
dt2 <- data.table(
id = id_space,
x5 = 11 * id_space
)
setkey(dt1, id)
setkey(dt2, id)
# dt-merge.R
source("dt-setup.R")
Rprofmem(filename = "dt-merge.out")
dt1 <- dt2[dt1, on = "id"]
Rprofmem(NULL)
# dt-keyed-assign.R
source("dt-setup.R")
Rprofmem(filename = "dt-keyed-assign.out")
dt1[dt2, on = "id", x5 := x5]
Rprofmem(NULL)
With all three scripts in my working directory, I ran each of the joining scripts in a separate R process.
system2("Rscript", "dt-merge.R")
system2("Rscript", "dt-keyed-assign.R")
I think the lines in the output files generally follow the pattern "<bytes> :<call stack>"
. I haven't found good documentation for this. However, the numbers in the front were never below 128, and this is the default minimum number of bytes below which R does not malloc
for vectors.
Note that not all of these allocations add to the total memory used by R. R might reuse some memory it already has after a garbage collection. So it's not a good way to measure how much memory is used at any specific time. However, if we assume garbage collection behavior is independent, it does work as a comparison between scripts.
Some sample lines of the memory report:
cat(readLines("dt-merge.out", 5), sep = "\n")
# 90208 :"get" "["
# 528448 :"get" "["
# 528448 :"get" "["
# 1072 :"get" "["
# 20608 :"get" "["
There are also lines like new page:"get" "["
for page allocations.
Luckily, these are simple to parse.
parse_memory_report <- function(path) {
report <- readLines(path)
new_pages <- startsWith(report, "new page:")
allocations <- as.numeric(gsub(":.*", "", report[!new_pages]))
total_malloced <- sum(as.numeric(allocations))
message(
"Summary of ", path, ":\n",
sum(new_pages), " new pages allocated\n",
sum(as.numeric(allocations)), " bytes malloced"
)
}
parse_memory_report("dt-merge.out")
# Summary of dt-merge.out:
# 12 new pages allocated
# 32098912 bytes malloced
parse_memory_report("dt-keyed-assign.out")
# Summary of dt-keyed-assign.out:
# 13 new pages allocated
# 14284272 bytes malloced
I got exactly the same results when repeating the experiment.
So keyed assignment has one more page allocation. The default byte size for a page is 2000. I'm not sure how malloc
works, and 2000 is tiny relative to all the allocations, so I'll ignore this difference. Please chastise me if this is dumb.
So, ignoring pages, keyed assignment allocated 55% less memory than the merge.