I have a large data.table which I need to subset, sum and group the same way on several occurrences in my code. Therefore, I store the result to save time. The operation still takes rather long and I would like to know how to speed it up.
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]
The keys of inventory are period and articleID. The size varies depending on the parameters but is always greater than 3 GB. It has about 62,670,000 rows of 7 variables.
I comment on my thought so far:
1. Subset: period > p
This could be faster with vector scanning, but I would need to generate the sequence from p to max(p) for that, taking additional time. Plus, the data.table is already sorted by p. So I suppose, the gain in speed is not high.
2. Aggregate: sum(incoming)
No idea how to improve this.
3. Group: by = articleID
This grouping might be faster with another key setting of the table, but this would have a bad impact on my other code.
4. Access: [, V1] This could be neglected and done during later operations, but I doubt a speed gain.
Do you have ideas for detailed profiling or improving this operation?
Minimum reproducible example
(decrease n
to make it run on your machine, if necessary):
library(data.table)
p <- 100
n <- 10000
inventory <- CJ(period=seq(1,n,1), weight=c(0.1,1), volume=c(1,10), price=c(1,1000), E_demand=c(1000), VK=seq(from=0.2, to=0.8, by=0.2), s=c(seq(1,99,1), seq(from=100, to=1000, by=20)))
inventory[, articleID:=paste0("W",weight,"V",volume,"P",price,"E", round(E_demand,2), "VK", round(VK,3), "s",s)]
inventory[, incoming:=rgamma( rate=1,shape=0.3, dim(inventory)[1])]
setkey(inventory, period, articleID)
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]