2

For reshaping wide to long data, I've generally found that melt.data.table from the data.table package is the most memory efficient method. This has been compared favourably to the dplyr and base packages online in terms of memory. However, I've found myself unable to melt an object of about 11GB in size, returning the message:

Error: cannot allocate vector of size 10.5 Gb

My Windows computer has 32GB RAM, using 64-bit R. My melt command with a 5-row verison of the dataset is included below (which works). The actual data has nearly 200 million rows.

library(data.table)
test <- structure(list(time = structure(c(667150201.25, 667153801.25, 667157401.25, 667161001.25, 667164601.25), tzone = "UTC", class = c("POSIXct", "POSIXt")),
                       red = c(-2.25, -2.375, -2.5, -0.5, -1.625),
                       orange = c(1.625, 1.375, 1.625, 2.25, 2.5),
                       yellow = c(1.25, 0.5, 1.5, 1.5, 1.625),
                       green = c(2.875, 2.625, 2.5, 3.25, 3),
                       blue = c(4.75, 4.5, 4.75, 4.75, 5.125),
                       purple = c(0.125, -0.125, 0.5, 1.25, 1.375),
                       violet = c(3.125, 2.875, 3.125, 3.375, 3.375),
                       pink = c(3.75, 1.75, 1.5, 1, 0.5)),
                       row.names = c(NA, -5L), class = c("data.table","data.frame"))
melt(test, id.vars='time',
  measure.vars=c('red','orange','yellow','green','blue','purple','violet','pink'),
  variable.name='color', value.name="value")

Is there a more memory efficient method to convert the data form wide to long? The goal is to have a dataset with 3 columns: time, color, and the value.

MBorg
  • 1,345
  • 2
  • 19
  • 38
  • Related [What methods can we use to reshape VERY large data sets?](https://stackoverflow.com/questions/55077668/what-methods-can-we-use-to-reshape-very-large-data-sets) – Henrik Nov 01 '21 at 10:29
  • I'll guess that the `object.size(realdata)` is roughly `1.44e10` (14.4GB). Also inferring that any reshaping operation will effectively duplicate the data, this brings you near to 29GB, which isn't going to work well in an environment (i.e., OS overhead, services, editor, other R objects, etc). My only suggestion for reshaping this is to break it up into chunks, reshape each of them, and recombine later; each step likely needs to be done in isolation of the others, meaning writing intermediate melts to disk and then `gc()`ing. Just a thought, no testing, sorry. – r2evans Nov 01 '21 at 11:54
  • Or just buy more RAM. – r2evans Nov 01 '21 at 11:55
  • When i convert it to a `pivot_longer` it allocates 3.7 KB, using your melt approach it allocates 17.1KB, but it does take significantly longer to run (70 us melt vs 1.1ms pivot_longer). – Donald Seinen Nov 01 '21 at 12:30

2 Answers2

1

If memory is the issue then taking smaller steps should solve it:

mv <- c('red','orange','yellow','green','blue','purple','violet','pink')
OUT <- data.table(color = character(0L), value = numeric(0L))
for (m in mv) {
  OUT <- rbind(OUT, test[, .(color = m, value = get(m))])
  set(test, j = m, value = NULL) # Delete the data since it is not necessary anymore
}
OUT[, time := rep(test$time, .N/nrow(test))]

A another option to try is RSQLite which will move the data straight into disk instead of memory:

library(RSQLite)

db_path <- 'C:/temp/temp.db'
test[, time := as.character(time)]
con <- dbConnect(SQLite(), db_path) 
mv <- c('red','orange','yellow','green','blue','purple','violet','pink')
OUT <- data.table(time = character(0L), color = character(0L), value = numeric(0L))
dbWriteTable(con, "long", OUT)
for (m in mv) {
  dbWriteTable(
    con, "long",
    test[, .(time = time, color = m, value = get(m))], 
    append = TRUE
  )
}

Now in a fresh session you can read it in steps:

step_size <- 20000L
IN <- data.table(time = character(0L), color = character(0L), value = numeric(0L))
rs <- dbSendQuery(con, "SELECT * FROM long")
while (!dbHasCompleted(rs)) {
  IN <- rbind(IN, dbFetch(rs, n = step_size))
}
dbClearResult(rs)
dbDisconnect(con)
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • The loop resulted in `Error: cannot allocate vector of size 6.6 Gb`. It's an improvement, but not much of an improvement. I have room for an extra 6.6 Gb, though I suspect the loop process reulted in multiple objects of that size – MBorg Nov 04 '21 at 07:54
  • Oh :/. How does the data come about? Do you read it from CSV or database? Then there are opportunities there. – s_baldur Nov 04 '21 at 08:31
  • I saved it as an R object which can be loaded. It exceeds the max number of rows for a CSV file, though it could be converted to a txt file – MBorg Nov 04 '21 at 08:35
  • I am not familiar with RSQLite, though it does sound like an interesting idea – MBorg Nov 04 '21 at 08:43
  • I could create the SQLite object, but then getting all the rows was too much memory, in addition to taking long for R to process. Getting the first 100 rows was easy though. – MBorg Nov 04 '21 at 10:41
  • Ok, so you are closer. You could read in the rows in a stepwise manner as well that should solve it. Will post an update in 1min. – s_baldur Nov 04 '21 at 10:51
  • @MBorg I'm very curious to hear if this works (maybe `rbind()` will run out of memory?). If it doesn't then I can think of another trick of pre-allocating the whole data.table and then use `set()` to fill it. – s_baldur Nov 04 '21 at 11:13
  • I tried this. After multiple days of running the while loop, my program finally crashed with no progress made :( – MBorg Nov 11 '21 at 03:00
0

perhaps an approach like this could work? I have no idea how to measure actual mem-usage..

library(data.table)
test.split <- split.default(test[, -1], names(test)[-1], )
data.table::rbindlist(lapply(test.split, cbind, test$time), use.names = FALSE, id = "color")
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • The test.split section works, but then I get `Error: cannot allocate vector of size 1.3 Gb` at the lapply step. I can fit an extra 1.3 Gb, but I think the iterative loop process causes it to eventually exceed capacity. Perhaps another method instead of lapply may be more efficient? – MBorg Nov 04 '21 at 08:34
  • try running the `lapply(...)` on the first n-chunks... lint `lapply(test.split[1:n], ...)` and see how many you can use in 1 go ;-) – Wimpel Nov 04 '21 at 09:02