0

I have a data.table similar to the one below, but with around 3 million rows and a lot more columns.

   key1 price qty status category
 1:    1  9.26   3      5        B
 2:    1 14.64   1      5        B
 3:    1 16.66   3      5        A
 4:    1 18.27   1      5        A
 5:    2  2.48   1      7        A
 6:    2  0.15   2      7        C
 7:    2  6.29   1      7        B
 8:    3  7.06   1      2        A
 9:    3 24.42   1      2        A
10:    3  9.16   2      2        C
11:    3 32.21   2      2        B
12:    4 20.00   2      9        B

Heres the dput() string

dados = structure(list(key1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4), 
    price = c(9.26, 14.64, 16.66, 18.27, 2.48, 0.15, 6.29, 7.06, 
    24.42, 9.16, 32.21, 20), qty = c(3, 1, 3, 1, 1, 2, 1, 1, 
    1, 2, 2, 2), status = c(5, 5, 5, 5, 7, 7, 7, 2, 2, 2, 2, 
    9), category = c("B", "B", "A", "A", "A", "C", "B", "A", 
    "A", "C", "B", "B")), .Names = c("key1", "price", "qty", 
"status", "category"), row.names = c(NA, -12L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000000004720788>)

I need to transform this data so that I have one entry for each key, and on the proccess I need to create some additional variables. So far I was using this:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}


key.aggregate = function(x){
  return(data.table(
    key1 = Mode(x$key1),
    perc.A = sum(x$price[x$category == "A"],na.rm=T)/sum(x$price),
    perc.B = sum(x$price[x$category == "B"],na.rm=T)/sum(x$price),
    perc.C = sum(x$price[x$category == "C"],na.rm=T)/sum(x$price),
    status = Mode(x$status),
    qty = sum(x$qty),
    price = sum(x$price)
  ))
}

new_data = split(dados,by = "key1") #Runs out of RAM here
results = rbindlist(lapply(new_data,key.aggregate))

And expecting the following output:

> results
   key1    perc.A    perc.B     perc.C status qty price
1:    1 0.5937447 0.4062553 0.00000000      5   8 58.83
2:    2 0.2780269 0.7051570 0.01681614      7   4  8.92
3:    3 0.4321208 0.4421414 0.12573782      2   6 72.85
4:    4 0.0000000 1.0000000 0.00000000      9   2 20.00

But I'm always running out of RAM when splitting the data by keys. I've tried using only a third of the data, and now only a sixth of it but it still gives the same Error: cannot allocate vector of size 593 Kb.

I'm thinking this approach is very inefficient, which would be the best way to get this result?

Fino
  • 1,774
  • 11
  • 21
  • 3
    There is definitely a more efficient way of doing this, but if you don't even have 593 Kb of memory available I'm thinking there's a bigger issue with your machine. – IceCreamToucan Nov 01 '18 at 17:48
  • I thought it was weird that the error points to 593Kb, this machine has 8GB and while running the script it gets to 100% of it. Could you share the direction of a more efficient way? – Fino Nov 01 '18 at 17:52
  • It may be better to use `dcast` after creating a column `dados[, perc := .N/sum(price), .(key1, category)]` – akrun Nov 01 '18 at 17:55
  • 1
    I suggest you investigate the memory problem, as that's not insignificant (*"640K ought to be enough for anyone"*, but you don't even have that). Big culprits: docker or other VM-based services; database servers; big xlsx/accdb open; bitcoin malware (?); other long-running processes (with or without memory leaks). – r2evans Nov 01 '18 at 17:57
  • 1
    Thanks @r2evans, I'll look into that. It's a work machine, so there's a lot of audit stuff running in the background that could be causing it. – Fino Nov 01 '18 at 18:03
  • I stinks that this is just might give you something ... reboot it? – r2evans Nov 01 '18 at 18:15
  • That error message doesn't mean that less than 593 Kb of memory is available. It just means that R can't allocate an *additional* 593 Kb of memory. See https://stackoverflow.com/questions/8920722/cannot-allocate-vector-in-r-despite-being-in-64-bit-version – MrFlick Nov 01 '18 at 19:59

0 Answers0