-1

What is the prefered way to do this?

Using [ and a named key vector to recode another vector is, I thought until recently, a robust and preferred "R" idiom for performing a common task. Is there a better way I should be doing this?

Details about the task: I have a character vector that has a length of approx 1e6, each element being one char long strings. I want to convert this vector to be numeric, such that ("B", "H", "K", "M"), which are abbreviations for an order of magnitude (H = hundred, M = million, etc.) become numeric (H = 100, M = 1e6, etc.) Any other chars not in the set of 4, or NAs, are to become 1.

After much trial and error I've tracked it down to the fact that NAs in the subsetting vector substantially slow down the operation. I find this inherently confusing, because it seems to me that subsetting with NA should, if anything, be faster, because it doesnt even need to search through the subsetted vector, it only needs to return an NA.

y <-  c("B", "H", "K", "M")
without_NA <- sample(rep_len(y, 1e6))
with_NA <- sample(rep_len(c(y, NA), 1e6))

convert_exponent_char_to_numeric <- function(exponent) {
  exponent_key <- 10^c(2, 3*1:3)
  names(exponent_key) <- c("H", "K", "M", "B")

  out <- exponent_key[exponent]
  out[is.na(out)] <- 1
  out
}

system.time(convert_exponent_char_to_numeric(without_NA))
   user  system elapsed 
  0.136   0.011   0.147 
system.time(convert_exponent_char_to_numeric(with_NA))
   user  system elapsed 
303.342   0.691 304.237 
t-kalinowski
  • 1,420
  • 11
  • 21
  • Using a `data.frame` lookup table will speed things up too, as per: http://stackoverflow.com/a/18457055/496803 – thelatemail Sep 11 '16 at 22:34
  • 2
    Use `match`: `out <- exponent_key[match(exponent,c("H", "K", "M", "B") )]`. – nicola Sep 11 '16 at 22:42
  • @nicola - True - I suspect the speed of the lookup table from the answer I linked is due to using `match`. – thelatemail Sep 11 '16 at 22:48
  • Thank you both. `match()` did the trick, and saves me a step by having a `nomatch` argument. I ended up using a data.frame to keep things organized, because using `names` didn't make sense anymore with this new approach. I still don't understand why `[` is so slow with `NA`s in the vector (to the point of being unusable). This seems like a bug. – t-kalinowski Sep 13 '16 at 14:20

1 Answers1

2

Here's a workaround to keep it from being slowed down by the extra code being invoked with NA detection:

y          <-  c("B", "H", "K", "M")
without_NA <- sample(rep_len(y, 1e6))
with_NA    <- sample(rep_len(c(y, NA), 1e6))
with_NA[is.na(with_NA)] <- "NA"

convert_exponent_char_to_numeric <- function(exponent) {
  exponent_key <- 10^c(2, 3*1:3)
  exponent_key <- c(exponent_key, 1)
  names(exponent_key) <- c("H", "K", "M", "B", "NA")

  out <- exponent_key[exponent]
  out
}

system.time(convert_exponent_char_to_numeric(without_NA))
   user  system elapsed 
   0.03    0.01    0.04
system.time(convert_exponent_char_to_numeric(with_NA))
   user  system elapsed 
   0.04    0.01    0.05

Now they are both well under 1 second. The 1/100th of a second extra time taken with the with_NA version is just because there are 5 levels to match on instead of 4.

Hack-R
  • 22,422
  • 14
  • 75
  • 131