Efficiently merging large data.tables

Question

I have two fairly large data.table objects that I want to merge.

dt1 has 500.000.000 observations on 5 columns.
dt2 has 300.000 observations on 2 columns.

Both objects have the same key called id.

I want to left_join information from dt2 into dt1.

For example:

dt1  <- data.table(id = c(1, 2, 3, 4),
               x1 = c(12, 13, 14, 15),
               x2 = c(5, 6, 7, 8),
               x3 = c(33, 44, 55, 66),
               x4 = c(123, 123, 123, 123))

dt2 <- data.table(id = c(1, 2, 3, 4),
              x5 = c(555, 666, 777, 888))
setkey(dt1, id)
setkey(dt2, id)

dt2[dt1, on="id"] 

> dt2[dt1, on="id"]
   id  x5 x1 x2 x3  x4
1:  1 555 12  5 33 123
2:  2 666 13  6 44 123
3:  3 777 14  7 55 123
4:  4 888 15  8 66 123

However, when merging my original data R can't allocate memory anymore. Yet, the output of the merge fits in the RAM.

What is the most efficient (speed vs. memory limitations) way of getting this large merge done?

Should we split-apply-combine?

Should we use a DB library to get this done?

How would you do this efficiently?

What specific memory problems do you get on the full dataset? Do you think the merged table will fit in your RAM, and R is just using too much memory during the merge? — Marius, May 07 '19 at 00:40
@Marius : The merged table will fit easily in my RAM. But during the merge, it can't allocate anymore. — wake_wake, May 07 '19 at 01:03
Possible duplicate: [*Left join using data.table*](https://stackoverflow.com/q/34598139/2204410) — Jaap, May 10 '19 at 15:54
Not really a duplicate of the above mentioned as OP is looking for ways to operate a left join with large datasets under resource constraints — zonfl, May 10 '19 at 16:33

Nathan Werth · Accepted Answer · 2019-05-09T19:27:39.013

Keyed assignment should save memory.

dt1[dt2, on = "id", x5 := x5]

Should we use a DB library to get this done?

That's probably a good idea. If setting up and using a database is painful for you, try the RSQLite package. It's pretty simple.

My experiment

tl;dr: 55% less memory used by keyed assignment compared to merge-and-replace, for a toy example.

I wrote two scripts that each sourced a setup script, dt-setup.R to create dt1 and dt2. The first script, dt-merge.R, updated dt1 through the "merge" method. The second, dt-keyed-assign.R, used keyed assignment. Both scripts recorded memory allocations using the Rprofmem() function.

To not torture my laptop, I'm having dt1 be 500,000 rows and dt2 3,000 rows.

Scripts:

# dt-setup.R
library(data.table)

set.seed(9474)
id_space <- seq_len(3000)
dt1  <- data.table(
  id = sample(id_space, 500000, replace = TRUE),
  x1 = runif(500000),
  x2 = runif(500000),
  x3 = runif(500000),
  x4 = runif(500000)
)
dt2 <- data.table(
  id = id_space,
  x5 = 11 * id_space
)
setkey(dt1, id)
setkey(dt2, id)

# dt-merge.R
source("dt-setup.R")
Rprofmem(filename = "dt-merge.out")
dt1 <- dt2[dt1, on = "id"]
Rprofmem(NULL)

# dt-keyed-assign.R
source("dt-setup.R")
Rprofmem(filename = "dt-keyed-assign.out")
dt1[dt2, on = "id", x5 := x5]
Rprofmem(NULL)

With all three scripts in my working directory, I ran each of the joining scripts in a separate R process.

system2("Rscript", "dt-merge.R")
system2("Rscript", "dt-keyed-assign.R")

I think the lines in the output files generally follow the pattern "<bytes> :<call stack>". I haven't found good documentation for this. However, the numbers in the front were never below 128, and this is the default minimum number of bytes below which R does not malloc for vectors.

Note that not all of these allocations add to the total memory used by R. R might reuse some memory it already has after a garbage collection. So it's not a good way to measure how much memory is used at any specific time. However, if we assume garbage collection behavior is independent, it does work as a comparison between scripts.

Some sample lines of the memory report:

cat(readLines("dt-merge.out", 5), sep = "\n")
# 90208 :"get" "[" 
# 528448 :"get" "[" 
# 528448 :"get" "[" 
# 1072 :"get" "[" 
# 20608 :"get" "["

There are also lines like new page:"get" "[" for page allocations.

Luckily, these are simple to parse.

parse_memory_report <- function(path) {
  report <- readLines(path)
  new_pages <- startsWith(report, "new page:")
  allocations <- as.numeric(gsub(":.*", "", report[!new_pages]))
  total_malloced <- sum(as.numeric(allocations))
  message(
    "Summary of ", path, ":\n",
    sum(new_pages), " new pages allocated\n",
    sum(as.numeric(allocations)), " bytes malloced"
  )
}

parse_memory_report("dt-merge.out")
# Summary of dt-merge.out:
# 12 new pages allocated
# 32098912 bytes malloced

parse_memory_report("dt-keyed-assign.out")
# Summary of dt-keyed-assign.out:
# 13 new pages allocated
# 14284272 bytes malloced

I got exactly the same results when repeating the experiment.

So keyed assignment has one more page allocation. The default byte size for a page is 2000. I'm not sure how malloc works, and 2000 is tiny relative to all the allocations, so I'll ignore this difference. Please chastise me if this is dumb.

So, ignoring pages, keyed assignment allocated 55% less memory than the merge.

That's really interesting. Would love to hear if this solves OPs problem as I've run into this issue a million times but somehow never bothered to set keys. — zonfl, May 10 '19 at 16:30
This solution seems to work well on a small subset of the data, but still runs into troubles on my 64GB RAM laptop. I am leaving this question open for others to suggest their solution. — wake_wake, May 11 '19 at 21:30
Nice. Yes, assignment on join don't need to create new data.table as result, while the first method creates fully new DT. This is AFAIK most memory efficient. Other trick that could be employed is to ensure smaller columns are being used if possible: int instead of double will reduce memory requirements by factor of 2. Changing platform to Linux might eventually help too. — jangorecki, May 12 '19 at 04:55
@wake_wake test that code snippet I posted. It should work (at least it always worked for me), but can't tell you how fast (or better slow) it's gonna be. And if it works try it again with mclapply on a subset of the data to see if it speeds up computations (I assume it will because the permutations are independent of each other). Also note that, at least from my experience, a DB with such dimensions probably won't allow you to perform these operations quicker if you don't have access to a performant cluster. — zonfl, May 15 '19 at 14:17

score 5 · Answer 2 · answered May 07 '19 at 18:36

If you must go for the split-merge approach and the following operation works with your memory, be sure to preallocate as much as possible in order to make the iterations faster. So something like this was the most efficient solution I could come up with when dealing with a similar problem:

dt1  <- data.table(id = c(1, 2, 3, 4),
                   x1 = c(12, 13, 14, 15),
                   x2 = c(5, 6, 7, 8),
                   x3 = c(33, 44, 55, 66),
                   x4 = c(123, 123, 123, 123))

dt2 <- data.table(id = c(1, 2, 3, 4),
                  x5 = c(555, 666, 777, 888))

dt1_id <- sort(unique(dt1$id)) # extract all ids that are in dt1
dt1_l_split <- length(dt1_id) # get number of iterations
dt2_l_split <- length(unique(dt2[id %in% dt1_id]$id))

split_dt1 <- vector(mode = "list", length = length(unique(dt1$id))) # preallocate vector
split_dt1 <- lapply(1:dt1_l_split, function(x) dt1[id %in% dt1_id[[x]]]) # fill list with splits

rm(dt1); gc() # remove the large data table to save memory and clean up RAM

dt1 <- lapply(1:dt1_l_split, function(i) {
  print(Sys.time())
  print(i)

  tmp <- dt2[id %in% dt1_id[[i]]] # load relevant parts from dt2
  merge(tmp, split_dt1[[i]], all = TRUE) # merge dt1 and dt2
})
rbindlist(dt1)

You could try to use mclapply from the parallel package to speed up your computations, I've had mixed result though, sometimes it would really speed things up, sometimes it would be slower, so I guess it' best to try that out.

Alternatively (and imo the easiest solution) just push the project into your Dropbox/Google Drive/Whatever cloud you prefer and set up a Google Cloud VM with 52GB RAM, a few CPUs, and Windows Server (yikes, but no need to set up a GUI, etc. yourself). Took me ~ 10 minutes to set everything up and you get a budget of $300 for the first year, which makes it basically free.

Efficiently merging large data.tables

2 Answers2

Linked