4

I want to know how to vectorize and memoize a custom function in R. It seems my way of thinking is not aligned with R's way of operation. So, I gladly welcome any links to good reading material. For example, R inferno is a nice resource, but it didn't help to figure out memoization in R.

More generally, can you provide a relevant usage example for the memoise or R.cache packages?

I haven't been able to find any other discussions on this subject. Searching for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching for those keywords at http://r-project.markmail.org/ does not return helpful discussions. I emailed the mailing list and did not receive a complete answer.

I am not solely interested in memoizing the GC function, and I am aware of Bioconductor and the various packages available there.

Here's my data:

seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")

Some sequences are missing, so they're blank "".

I have a function for calculating GC content:

> GC <- function(s) {
    if (!is.character(s)) return(NA)
    n <- nchar(s)
    if (n == 0) return(NA)
    m <- gregexpr('[GCSgcs]', s)[[1]]
    if (m[1] < 1) return(0)
    return(100.0 * length(m) / n)
}

It works:

> GC('')
[1] NA
> GC('G')
[1] 100
> GC('GAG')
[1] 66.66667
> sapply(seqs, GC)
                  G         C       CCC         T               TTCCT           
       NA 100.00000 100.00000 100.00000   0.00000        NA  40.00000        NA 
        C       CTC 
100.00000  66.66667

I want to memoize it. Then, I want to vectorize it.

Apparently, I must have the wrong mindset for using the memoise or R.cache R packages:

> system.time(dummy <- sapply(rep(seqs,100), GC))
   user  system elapsed
  0.044   0.000   0.054
>
> library(memoise)
> GCm1 <- memoise(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm1))
   user  system elapsed
  0.164   0.000   0.173
>
> library(R.cache)
> GCm2 <- addMemoization(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm2))
   user  system elapsed
 10.601   0.252  10.926

Notice that the memoized functions are several orders of magnitude slower.

I tried the hash package, but things seem to be happening behind the scenes and I don't understand the output. The sequence C should have a value of 100, not NULL.

Note that using has.key(s, cache) instead of exists(s, cache) results in the same output. Also, using cache[s] <<- result instead of cache[[s]] <<- result results in the same output.

> cache <- hash()
> GCc <- function(s) {
    if (!is.character(s) || nchar(s) == 0) {
        return(NA)
    }
    if(exists(s, cache)) {
        return(cache[[s]])
    }
    result <- GC(s)
    cache[[s]] <<- result
    return(result)
}
> sapply(seqs,GCc)
[[1]]
[1] NA

$G
[1] 100

$C
NULL

$CCC
[1] 100

$T
NULL

[[6]]
[1] NA

$TTCCT
[1] 40

[[8]]
[1] NA

$C
NULL

$CTC
[1] 66.66667

At least I figured out how to vectorize:

> GCv <- Vectorize(GC)
> GCv(seqs)
                  G         C       CCC         T               TTCCT           
       NA 100.00000 100.00000 100.00000   0.00000        NA  40.00000        NA 
        C       CTC 
100.00000  66.66667 

Relevant stackoverflow posts:

Community
  • 1
  • 1
Kamil Slowikowski
  • 4,184
  • 3
  • 31
  • 39

2 Answers2

2

While this won't give you memoization across calls, you can use factors to make individual calls a lot faster if there is a fair bit of repetition. Eg using Joshua's GC2 (though I had to remove fixed=T to get it to work):

GC2 <- function(s) {
  if(!is.character(s)) stop("'s' must be character")
  n <- nchar(s)
  m <- gregexpr('[GCSgcs]', s)
  len <- sapply(m, length)
  neg <- sapply(m, "[[", 1)
  len <- len*(neg > 0)
  100.0 * len/n
}

One can easily define a wrapper like:

GC3 <- function(s) {
  x <- factor(s)
  GC2(levels(x))[x]
}

system.time(GC2(rep(seqs, 50000)))
# user  system elapsed 
# 8.97    0.00    8.99 
system.time(GC3(rep(seqs, 50000)))
# user  system elapsed 
# 0.06    0.00    0.06 
Charles
  • 4,389
  • 2
  • 16
  • 13
1

This doesn't explicitly answer your question, but this function is ~4 times faster than yours.

GC2 <- function(s) {
  if(!is.character(s)) stop("'s' must be character")
  n <- nchar(s)
  m <- gregexpr('[GCSgcs]', s)
  len <- sapply(m, length)
  neg <- sapply(m, "[[", 1)
  len <- len*(neg > 0)
  len/n
}
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • On my system (x86_64 linux-gnu R 2.15.0), your function is ~3 times slower. I'm testing the speed like this: `system.time(sapply(rep(seqs, 1000), GC))` gets 0.418 elapsed seconds. `system.time(sapply(rep(seqs, 1000), GC2))` gets 1.401 elapsed seconds. – Kamil Slowikowski May 01 '12 at 22:22
  • @KamilSlowikowski That's because you're using the function wrong. Joshua vectorized it, so you should have compared it to `GC2(rep(seqs,1000))`, in which case it is indeed ~4x faster. – joran May 01 '12 at 22:28
  • @KamilSlowikowski Although I think he didn't mean to set `fixed = TRUE`. – joran May 01 '12 at 22:36
  • Indeed, setting `fixed=T` results in all of the `[[1]]` elements being `-1`. Joshua, thanks for the code! I now know how to take one element out of each item in a list and that `1 * TRUE == 1` and `1 * FALSE == 0`. I feel silly now, realizing that `nchar` and `substr` work on vectors ("lists") of items, especially because the help `?nchar` says so explicitly. – Kamil Slowikowski May 01 '12 at 22:55
  • @joran: thanks for pointing that out. It was part of my testing because `?gregexpr` says using `fixed=TRUE` can be faster... but it shouldn't have made its way into my answer. – Joshua Ulrich May 01 '12 at 23:21