2

Problem

There is a 1G CSV file containing 3.3 million rows in 40 columns:

DT
   ID a  b  c  d ...
1:  b 1  7 13 19
2:  b 2  8 14 20
3:  b 3  9 15 21
4:  a 4 10 16 22
5:  a 5 11 17 23
6:  c 6 12 18 24
...

I need to merge the rows by a column (let's call it ID):

outDT
   ID a  b        c        d ...
1:  b 3  9 13,14,15 19,20,21
2:  a 5 11    16,17    22,23
3:  c 6 12       18       24
...

Some columns can store only single value (subsequent rows override the previous). The rest of the columns are considered multi-value (the cells should be grouped into lists).

What I have tried

The following snippet is working with small data sets:

library("data.table")
DT <- data.table(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18, d = 19:24)
DT
#   ID a  b  c  d
#1:  b 1  7 13 19
#2:  b 2  8 14 20
#3:  b 3  9 15 21
#4:  a 4 10 16 22
#5:  a 5 11 17 23
#6:  c 6 12 18 24

id <- "ID"
overwriteCols <- c("a", "b")
groupCols <- names(DT)[!names(DT) %in% c(overwriteCols, id)]

DTo <- DT[, lapply(.SD, last), by = id, .SDcols = overwriteCols]
DTg <- DT[, lapply(.SD, (function(x) x = list(x))), by = id, .SDcols = groupCols]
outDT <- DTo[DTg, on = id]
outDT
#   ID a  b        c        d
#1:  b 3  9 13,14,15 19,20,21
#2:  a 5 11    16,17    22,23
#3:  c 6 12       18       24

but it suffers from the lack of memory when the sample data table is replaced with the real one:

id <- "PROVIDER_ID"
dropCols <- c(
  # ...
)
DT <- fread(path,
            header = T,
            sep="|",
            stringsAsFactors = T,
            blank.lines.skip = T,
            nThread = 3,
            fill = T,
            drop = dropCols,
            key = c("PROVIDER_ID"))

The operation on groupCols is particularly demanding. I tried chaining the operations on overwriteCols and groupCols, but the grouping seems to be too heavy on its own. The script fails with a memory limit of 4G on my laptop.

I thought of chunked processing, but I couldn't come up with any working solution.

Question

How can I make my code more memory-efficient?

nicola
  • 24,005
  • 3
  • 35
  • 56
Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
  • How about storing `DTo` - after it's created - in a .rds file, removing DTo from memory, creating DTg, then removing DT from memory and afterwards combining DTo and DTg? – talat Jun 18 '18 at 10:48
  • 1
    You don't need to store `DTo` in the Global Environment (and thus memory). You can also do: `DT[, (overwriteCols) := lapply(.SD, last), by = id, .SDcols = overwriteCols][, lapply(.SD, function(x) x = list(x)), by = c(overwriteCols, id)]` – Jaap Jun 18 '18 at 10:49
  • How about doing something like this (combining csv files without loading them to memory?) : https://stackoverflow.com/questions/37444120/how-to-combine-multiple-csv-files-into-one-big-file-without-loading-the-actual-f – DJV Jun 18 '18 at 10:51
  • @Jaap, thanks. Yes, I tried updating `overwriteCols` by reference, but my approach was slightly more complicated. Your code lasts must longer (5 min) but still consumes more than 4G of RAM :-( – Ruslan Osmanov Jun 18 '18 at 11:02
  • @RuslanOsmanov I think that the main issue with R is that it uses your RAM (mainly if not only). Thus, I guess that the easier way (might not be the right way) is to write straight to the disk without loading the files to R. – DJV Jun 18 '18 at 11:06
  • @DJV it's not just a matter of combining csv files, though. – talat Jun 18 '18 at 11:11
  • @DJV, well, I need to process an existing CSV file in some way, and I have chosen R as a tool. Do you mean that R is not the right tool for this task? I don't quite understand you. Actually, I need to convert the CSV file to NDJSON. I've skipped this part from the question. – Ruslan Osmanov Jun 18 '18 at 11:11
  • @docendodiscimus I understand but I think that it's the main issue, isn't it (I might be wrong here)?. – DJV Jun 18 '18 at 11:14
  • @RuslanOsmanov I would never say that R is the wrong tool ;). However, if you have an issue with the memory, I'm just thinking about a way to pass that. I would try to do my adjustments for each file, save them as `RDS`/`feather`, and than merge without loading them to memory. Again, I might be wrong here. I would love to hear a simpler/easier solution. – DJV Jun 18 '18 at 11:17
  • @DJV sorry but I guess you are indeed wrong here. OP is dealing with "There is a 1G CSV", no word about different files that need to be merged. It's about a specific sort of /aggregating reshaping the data – talat Jun 18 '18 at 11:20
  • @RuslanOsmanov, I wonder if you've tried to set keys on the data.table and just run the second part of your aggregation (`DTg <- DT[, lapply(.SD, (function(x) x = list(x))), by = id, .SDcols = groupCols]`). Does that work generally or does it stop due to memory issues? – talat Jun 18 '18 at 11:22
  • @docendodiscimus, something like `setkeyv(DTg, c("ID"))`? The script fails in the process of building `DTg`, so I even haven't a chance to set keys for `DTg`. – Ruslan Osmanov Jun 18 '18 at 11:37
  • @RuslanOsmanov Since you group by `id`, you should set `id` as the key of the original `DT` (and not on `DTg`). – nicola Jun 18 '18 at 11:45
  • @nicola, the key is set via `fread` option implicitly: `key = c("PROVIDER_ID")`. I've just tried `setkeyv(DT, c("PROVIDER_ID"))` after `fread`. Unfortunately, it didn't help. Thanks. – Ruslan Osmanov Jun 18 '18 at 11:47
  • Try to read in only the ID + groupcols columns before creating DTg. Does that work? – talat Jun 18 '18 at 13:08
  • @docendodiscimus, no, even `DTg[, lapply(.SD, (function(x) x = list(unique(x)))), by = id, .SDcols = groupCols]` on its own tries to consume more than 4G (with `overwriteCols` added to `drop` list) – Ruslan Osmanov Jun 18 '18 at 14:25
  • Possible duplicate of [Evaluate expression given as a string](https://stackoverflow.com/questions/1743698/evaluate-expression-given-as-a-string) – Ameya Jun 18 '18 at 23:39
  • @Ameya my question doesn't duplicate this post. – Ruslan Osmanov Jun 19 '18 at 09:42
  • You're right. Sorry, I was marking it for another question. I must've mixed it up. – Ameya Jun 19 '18 at 10:26

0 Answers0