2

We are looking for a blazing fast solution to the following problem, in R (Rcpp is allowed).

I have a character vector:

set.seed(42)
x <- sample(LETTERS[1:4], 1e6, replace = TRUE)

And I want to change it to a non sequential numeric vector, where:

A = 5
B = 4
C = 3
D = 1

For example:

c("A", "B", "C", "D")

Would be:

c(5,4,3,1)

The interns and I have what we think is a ridiculously fast solution already but we think the Internet can beat us. We'll add our fastest solution as an answer after we get some responses.

Let's see!

Timings so far:

library(microbenchmark)

set.seed(42)
x <- sample(LETTERS[1:4], 1e6, replace = TRUE)

richscriven <- function(x) {
  as.vector(c(A=5, B=4, C=3, D=2, E=1)[x])
}

richscriven_unname <- function(x) {
  unname(c(A=5, B=4, C=3, D=2, E=1)[x])
}

richscriven_op <- function(x) {
  (5:1)[c(factor(x))]
}

op_and_interns_fun <- function(x) {
  c(5,4,3,1)[as.numeric(as.factor(x))]
}

ronakshah <- function(x) {
  vec = c("A" = 5, "B" = 4, "C" = 3, "D" = 1)
  unname(vec[match(x, names(vec))])
}

microbenchmark(
  richscriven_unname(x),
  richscriven(x),
  richscriven_op(x),
  op_and_interns_fun(x),
  ronakshah(x),
  times = 15
)

Unit: milliseconds
                  expr      min       lq     mean   median       uq       max neval
 richscriven_unname(x) 36.06018 38.01026 62.80854 38.87179 41.86411 337.65773    15
        richscriven(x) 37.90615 41.61194 43.50555 44.14130 45.17277  47.47804    15
     richscriven_op(x) 31.70345 37.43262 44.10522 41.34828 45.22127  88.79605    15
 op_and_interns_fun(x) 40.18935 44.20475 49.48811 45.77867 48.15706  99.85034    15
          ronakshah(x) 29.36408 32.52615 42.40753 35.09052 38.55763  95.78571    15
Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • This is not a StackOverflow question. This is a CodeReview-based question. – coatless Nov 10 '17 at 04:23
  • 1
    @Jaap this isnt a a strict letters to number as you've marked it - its non sequential if you looked closer. – zacdav Nov 10 '17 at 23:37
  • Please don't add comments to your question but post them as comments. That way the user also gets pinged. If it's not a duplicate, please edit the question to explain *why* the duplicate does not answer your question. – robinCTS Nov 14 '17 at 01:24
  • @BrandonBertelsen How is this not code review material? Your present solution was obfuscated by placing it not in the opening salvo but as an answer. If anything, this is a variety of code _golf_. The basis of the question is in getting the _fastest_ algorithm for recoding data. This sort of implies a _code review_. Anyhow, tomato tomatoe. – coatless Nov 14 '17 at 01:32

2 Answers2

3

We can put the vector in a named numeric vector

vec <-  c("A" = 5, "B" = 4, "C" = 3, "D" = 1)

We can then write a function,

get_recoded_data <- function(num_vec, recode_data) {
   unname(recode_data[match(num_vec, names(recode_data))]) 
}

and call the function using

get_recoded_data(x, vec)

On my system it takes,

system.time(get_recoded_data(x, vec))
#user  system elapsed 
#0.028   0.004   0.032 

I am using MacOS Sierra 10.12.6, 16GB RAM i7 RStudio 1.1.383


From @zacdav's comment using fmatch function from fastmatch package gives a good performance enhancement

get_recoded_data <- function(num_vec, recode_data) {
  unname(recode_data[fmatch(num_vec, names(recode_data))]) 
}

Checking it on the same data, I get

system.time(get_recoded_data(x, vec))
#user  system elapsed 
#0.017   0.004   0.021 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 2
    `unname(vec[fastmatch::fmatch(x, names(vec))])` using `fastmatch` package is a huge speed jump again. – zacdav Nov 10 '17 at 04:04
  • @zacdav Thanks, didn't know about `fastmatch` package. – Ronak Shah Nov 10 '17 at 04:11
  • If you place last round bracket in another place: `unname(recode_data)[fmatch(num_vec, names(recode_data))]` then timings will become better almost twice - from 26 ms to 16 ms on my system. – Gregory Demin Nov 17 '17 at 18:57
1

Our answer relies on a somewhat uncommon method of subsetting by position:

op_and_interns_fun <- function(x) {
  c(5,4,3,1)[as.numeric(as.factor(x))]
}
Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • why isn't this just posted in the question? You clearly already had this solution at time of posting. – zacdav Nov 10 '17 at 04:10
  • We didn't want to discourage people if our final solution was faster than theirs. I'm working with interns that I'd like to show a variety of different answers (we have about 6 other solutions that are all much slower and not shown here), you know, for learning purposes :) – Brandon Bertelsen Nov 10 '17 at 04:11