R speed of data.table

Question

I have a specific performance issue, that i wish to extend more generally if possible.

Context:

I've been playing around on google colab with a python code sample for a Q-Learning agent, which associate a state and an action to a value using a defaultdict:

self._qvalues = defaultdict(lambda: defaultdict(lambda: 0))
return self._qvalues[state][action]

Not an expert but my understanding is it returns the value or add and returns 0 if the key is not found.

i'm adapting part of this in R.
the problem is I don't how many state/values combinations I have, and technically i should not know how many states I guess.
At first I went the wrong way, with the rbind of data.frames and that was very slow.
I then replaced my R object with a data.frame(state, action, value = NA_real). it works but it's still very slow. another problem is my data.frame object has the maximum size which might be problematic in the future.
then I chanded my data.frame to a data.table, which gave me worst performance, then I finally indexed it by (state, action).

qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
                        qaction = rep(seq(nbactions), times = nbstates),
                        qvalue = NA_real_,
                        stringsAsFactors =  FALSE)
setkey(qvalues, "qstate", "qaction")

Problem:

Comparing googlecolab/python vs my local R implementation, google performs 1000x10e4 access to the object in let's say 15s, while my code performs 100x100 access in 28s. I got 2s improvements by byte compiling but that's still too bad.

Using profvis, I see most of the time is spent accessing the data.table on these two calls:

qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
self$qvalues[J(state, action)]$qvalue <- value

I don't really know what google has, but my desktop is a beast. Also I saw some benchmarks stating data.table was faster than pandas, so I suppose the problem lies in my choice of container.

Questions:

is my use of a data.table wrong and can be fixed to improve and match the python implementation?
is another design possible to avoid declaring all the state/actions combinations which could be a problem if the dimensions become too large?
i've seen about the hash package, is it the way to go?

Thanks a lot for any pointer!

UPDATE:

thanks for all the input. So what I did was to replace 3 access to my data.table using your suggestions:

#self$qvalues[J(state, action)]$qvalue <- value
self$qvalues[J(state, action), qvalue := value]
#self$qvalues[J(state, action),]$qvalue <- 0
self$qvalues[J(state, action), qvalue := 0]
#qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]

this dropped the runtime from 33s to 21s that's a massive improvement, but that's still extremely slow compared to the python defaultdict implementation.

I noted the following:
working in batch: I don't think I can do as the call to the function depends on the previous call.
peudospin> I see you are surprised the get is time consuming. so am I but that's what profvis states: and here the code of the function as a reference:

QAgent$set("public", "get_qvalue", function( state, action) {
  #qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
  qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]
  if (is.na(qval)) {
    #self$qvalues[self$qvalues$qstate == state & self$qvalues$qaction == action,]$qvalue <- 0
    #self$qvalues[J(state, action),]$qvalue <- 0
    self$qvalues[J(state, action), qvalue := 0]
    return(0)
  }
  return(qval)
})

At this point, if no more suggestion, I will conclude the data.table is just too slow for this kind of task, and I should look into using an env or a collections. (as suggested there: R fast single item lookup from list vs data.table vs hash )

CONCLUSION:

I replaced the data.table for a collections::dict and the bottleneck completely disappeared.

score 2 · Accepted Answer · answered Aug 18 '20 at 19:45

2

data.table is fast for doing lookups and manipulations in very large tables of data, but it's not going to be fast at adding rows one by one like python dictionaries. I'd expect it would be copying the whole table each time you add a row which is clearly not what you want.

You can either try to use environments (which are something like a hashmap), or if you really want to do this in R you may need a specialist package, here's a link to an answer with a few options.

answered Aug 18 '20 at 19:45

pseudospin

2,737
1
4
19

hum... but are you saying i'm copying the whole table when i do that? self$qvalues[J(state, action)]$qvalue <- value – Will Aug 18 '20 at 19:56
It wasn't clear what those lines were doing, I was only really responding to the general thrust of your question. That one in particular looks 'wrong' though. What is it intended to be doing? On reflection I feel it should be `self$qvalues[J(state, action), qvalue := value]` – pseudospin Aug 18 '20 at 20:10
well having a data.table with columns `(qstate, qaction, qvalue)` I want to set the qvalue cell of the row where `qstate == state and qaction == action`. I believe i could also write it this way: `mydf[mydf$qstate == state & mydf$qaction == action,]$qvalue <- value` and yes when i was doing the rbind it was bad but i thought the code above avoided the copy hence should be fast. well it's faster, but still very slow compared to python defaultdict. – Will Aug 18 '20 at 20:35
I'm honestly surprised that your version does what you say - but it does, I just checked. I also checked speed and you'll get a big boost if you switch that line to my version above which does the same thing. – pseudospin Aug 18 '20 at 20:44
all right. i will try tomorrow with your version, then with the collections package if the speed is still not up to par. thanks a lot! (FTR more than 50% of the time was spent on `self$qvalues[J(state, action), nomatch = NA_real_]$qvalue` (reading the qvalue of a given row) so I can't expect a boost of more than 50% at best) – Will Aug 18 '20 at 20:52
@Will in data.table whenever you modify a single element using `<-` apparently the entire data.table is copied (not only the modified vector, differently from data.frame objects). So always try to modify it by reference whenever is possible. Check [understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another](https://stackoverflow.com/a/10226454/11587489) and the comments below it. – daniellga Aug 18 '20 at 21:04
I don't think I can speed the other line up, but I don't really believe that is where the time is being spent though. Is there a line after that where you are adding a row when it returns NA? Because that's where the time is likely going. Good luck! – pseudospin Aug 18 '20 at 21:07
One last thing to add. If there's any way you can do the whole procedure in batches rather than looking up single pairs of (state,action) then you will be able to get a massive speed up. – pseudospin Aug 18 '20 at 21:18

score 0 · Answer 2 · answered Aug 18 '20 at 21:53

Benchmark

library(data.table)
Sys.setenv('R_MAX_VSIZE'=32000000000) # add to the ram limit
setDTthreads(threads=0) # use maximum threads possible
nbstates <- 1e3
nbactions <- 1e5

cartesian <- function(nbstates,nbactions){
    x= data.table(qstate=1:nbactions)
    y= data.table(qaction=1:nbstates)
    k = NULL
    x = x[, c(k=1, .SD)]
    setkey(x, k)
    y = y[, c(k=1, .SD)]
    setkey(y, NULL)
    x[y, allow.cartesian=TRUE][, c("k", "qvalue") := list(NULL, NA_real_)][]
}

#comparing seq with `:`
(bench = microbenchmark::microbenchmark(
    1:1e9,
    seq(1e9),
    times=1000L
    ))
#> Unit: nanoseconds
#>        expr  min   lq     mean median   uq   max neval
#>     1:1e+09  120  143  176.264  156.0  201  5097  1000
#>  seq(1e+09) 3039 3165 3333.339 3242.5 3371 21648  1000
ggplot2::autoplot(bench)


(bench = microbenchmark::microbenchmark(
    "Cartesian product" = cartesian(nbstates,nbactions),
    "data.table assignement"=qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
                        qaction = rep(seq(nbactions), times = nbstates),
                        qvalue = NA_real_,
                        stringsAsFactors =  FALSE),
    times=100L))
#> Unit: seconds
#>                    expr      min       lq     mean   median       uq      max neval
#>       Cartesian product 3.181805 3.690535 4.093756 3.992223 4.306766 7.662306  100
#>  data.table assignement 5.207858 5.554164 5.965930 5.895183 6.279175 7.670521  100
#>      data.table  (1:nb) 5.006773 5.609738 5.828659  5.80034 5.979303 6.727074  100
#>  
#>  
ggplot2::autoplot(bench)

it is clear the using seq consumes more time than calling the 1:nb. plus using a cartesian product makes the code faster even when 1:nb is used

R speed of data.table

2 Answers2

Benchmark