Problem
There is a 1G CSV file containing 3.3 million rows in 40 columns:
DT
ID a b c d ...
1: b 1 7 13 19
2: b 2 8 14 20
3: b 3 9 15 21
4: a 4 10 16 22
5: a 5 11 17 23
6: c 6 12 18 24
...
I need to merge the rows by a column (let's call it ID
):
outDT
ID a b c d ...
1: b 3 9 13,14,15 19,20,21
2: a 5 11 16,17 22,23
3: c 6 12 18 24
...
Some columns can store only single value (subsequent rows override the previous). The rest of the columns are considered multi-value (the cells should be grouped into lists).
What I have tried
The following snippet is working with small data sets:
library("data.table")
DT <- data.table(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18, d = 19:24)
DT
# ID a b c d
#1: b 1 7 13 19
#2: b 2 8 14 20
#3: b 3 9 15 21
#4: a 4 10 16 22
#5: a 5 11 17 23
#6: c 6 12 18 24
id <- "ID"
overwriteCols <- c("a", "b")
groupCols <- names(DT)[!names(DT) %in% c(overwriteCols, id)]
DTo <- DT[, lapply(.SD, last), by = id, .SDcols = overwriteCols]
DTg <- DT[, lapply(.SD, (function(x) x = list(x))), by = id, .SDcols = groupCols]
outDT <- DTo[DTg, on = id]
outDT
# ID a b c d
#1: b 3 9 13,14,15 19,20,21
#2: a 5 11 16,17 22,23
#3: c 6 12 18 24
but it suffers from the lack of memory when the sample data table is replaced with the real one:
id <- "PROVIDER_ID"
dropCols <- c(
# ...
)
DT <- fread(path,
header = T,
sep="|",
stringsAsFactors = T,
blank.lines.skip = T,
nThread = 3,
fill = T,
drop = dropCols,
key = c("PROVIDER_ID"))
The operation on groupCols
is particularly demanding. I tried chaining the operations on overwriteCols
and groupCols
, but the grouping seems to be too heavy on its own. The script fails with a memory limit of 4G
on my laptop.
I thought of chunked processing, but I couldn't come up with any working solution.
Question
How can I make my code more memory-efficient?