0

I have a large data.table which I need to subset, sum and group the same way on several occurrences in my code. Therefore, I store the result to save time. The operation still takes rather long and I would like to know how to speed it up.

inco <- inventory[period > p, sum(incoming), by = articleID][,V1]

The keys of inventory are period and articleID. The size varies depending on the parameters but is always greater than 3 GB. It has about 62,670,000 rows of 7 variables.

I comment on my thought so far:

1. Subset: period > p

This could be faster with vector scanning, but I would need to generate the sequence from p to max(p) for that, taking additional time. Plus, the data.table is already sorted by p. So I suppose, the gain in speed is not high.

2. Aggregate: sum(incoming)

No idea how to improve this.

3. Group: by = articleID

This grouping might be faster with another key setting of the table, but this would have a bad impact on my other code.

4. Access: [, V1] This could be neglected and done during later operations, but I doubt a speed gain.

Do you have ideas for detailed profiling or improving this operation?

Minimum reproducible example

(decrease n to make it run on your machine, if necessary):

library(data.table)
p <- 100
n <- 10000
inventory <- CJ(period=seq(1,n,1), weight=c(0.1,1), volume=c(1,10), price=c(1,1000), E_demand=c(1000), VK=seq(from=0.2, to=0.8, by=0.2), s=c(seq(1,99,1), seq(from=100, to=1000, by=20)))
inventory[, articleID:=paste0("W",weight,"V",volume,"P",price,"E", round(E_demand,2), "VK", round(VK,3), "s",s)]
inventory[, incoming:=rgamma( rate=1,shape=0.3, dim(inventory)[1])]
setkey(inventory, period, articleID)
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]
Frank
  • 66,179
  • 8
  • 96
  • 180
mondano
  • 827
  • 10
  • 29
  • 2
    Without a reproducible example it will be difficult to make recommendations for improvement. Also, how large is your data roughly in terms of number of columns, rows and distinct articleIDs? – talat Dec 01 '15 at 08:16
  • Without further knowledge of your project my comment is very general. However, you state that you need to repeat the same operation in the same way multiple times. That leads me to wonder if the architecture and design of your project could be optimized? Also, with all things regarding R and speed, have you considered Revolution R Open. Depending on your needs, it can be a gamechanger: https://mran.revolutionanalytics.com/documents/rro/multithread/#mt-bench –  Dec 01 '15 at 08:57
  • Parallelization might bring speed benefits, but I have also RAM as a limiting factor, so I am not sure, how this will end up. – mondano Dec 01 '15 at 10:01
  • The subsetting (vector scan) takes 3 seconds on my system, the whole line less than 4 seconds. Can you please tell us what kind of speed-up you'd need? – Roland Dec 01 '15 at 12:34
  • I am doing a simulation and this is part of a for loop which is run about 10,000 times. The single operation is of course not very time-consuming, but as the loop is run 10,000 times, a decrease by a second would help a lot. – mondano Dec 01 '15 at 13:27
  • Are you sure you need to do this with 63 million observations? Can't you downsample this? – Roland Dec 01 '15 at 14:32
  • 5
    How on earth did you get the idea that giving code that contructs a *data.table* of 46 million rows qualifies as a [minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610)? – Jaap Dec 01 '15 at 20:37

0 Answers0