17

I run across this often enough that I figure there has to be a good idiom for it. Suppose I have a data.frame with a bunch of attributes, including "product." I also have a key which translates products to brand + size. Product codes 1-3 are Tylenol, 4-6 are Advil, 7-9 are Bayer, 10-12 are Generic.

What's the fastest (in terms of human time) way to code this up?

I tend to use nested ifelse's if there are 3 or fewer categories, and type out the data table and merge it in if there are more than 3. Any better ideas? Stata has a recode command that is pretty nifty for this sort of thing, although I believe it promotes data-code intermixing a little too much.

dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L, 
7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA, 
-20L), class = "data.frame")
divibisan
  • 11,659
  • 11
  • 40
  • 58
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235

13 Answers13

19

You could convert your variable to a factor and change its levels by levels<- function. In one command it could be like:

`levels<-`(
    factor(dat$product),
    list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
)

In steps:

brands <- factor(dat$product)
levels(brands) <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
Nick Cox
  • 35,529
  • 6
  • 31
  • 47
Marek
  • 49,472
  • 15
  • 99
  • 121
14

One could use a list as an associative array to define the brand -> product code mapping, i.e.:

brands <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)

Once you have this, you can then either invert this to create a product code -> brand list (could take a lot of memory), or just use a search function:

find.key <- function(x, li, default=NA) {
    ret <- rep.int(default, length(x))
    for (key in names(li)) {
        ret[x %in% li[[key]]] <- key
    }
    return(ret)
}

I'm sure there are better ways of writing this function (the for loop is annoying me!), but at least it is vectorised, so it only requires a single pass through the list.

Using it would be something like:

> dat$brand <- find.key(dat$product, brands)
> dat
   product   brand
1       11 Generic
2       11 Generic
3        9   Bayer
4        9   Bayer
5        6   Advil
6        1 Tylenol
7       11 Generic
8        5   Advil
9        7   Bayer
10      11 Generic
11       5   Advil
12      11 Generic
13       4   Advil
14       3 Tylenol
15      10 Generic
16       7   Bayer
17      10 Generic
18       5   Advil
19       9   Bayer
20       8   Bayer

The recode and levels<- solutions are very nice, but they are also significantly slower than this one (and once you have find.key this is easier-for-humans than recode and on par with the levels<-):

> microbenchmark(
     recode=recode(dat$product,recodes="1:3='Tylenol';4:6='Advil';7:9='Bayer';10:12='Generic'"), 
     find.key=find.key(dat$product, brands),
     levels=`levels<-`(factor(dat$product),brands))
Unit: microseconds
      expr      min        lq    median        uq      max
1 find.key   64.325   69.9815   76.8950   83.8445  221.748
2   levels  240.535  248.1470  274.7565  306.8490 1477.707
3   recode 1636.039 1683.4275 1730.8170 1855.8320 3095.938

(I can't get the switch version to benchmark properly, but it appears to be faster than all of the above, although it is even worse-for-humans than the recode solution.)

huon
  • 94,605
  • 21
  • 231
  • 225
  • Fun solution, but definitely doesn't pass the faster-for-humans muster! – Ari B. Friedman May 03 '12 at 13:21
  • Why not? `find.key` is a generic function you can just copy paste into your code and use. – huon May 03 '12 at 13:25
  • Newer version looks very easy to use. This version didn't: `cbind(dat,dat$brand brand=find<- find.key(dat$product, brands))`. But now that I actually look at it it's not complicated either. Morning stupidity :-) – Ari B. Friedman May 03 '12 at 13:34
  • Very nice response and very fast. I like it a lot. +1 – Tyler Rinker May 03 '12 at 13:41
  • @gsk3, yeah, I had trouble working out how to vectorise `find.key` and so didn't think how to do the last step very much. Evening mindblank :) – huon May 03 '12 at 13:47
  • Here's a version without a `for` loop; unsure if it would be any faster, but here it is anyway: `find.key2 <- function(x, li) { cut(match(x, unlist(li)), c(0, cumsum(sapply(li, length))), labels=names(li)) }` – Aaron left Stack Overflow Sep 05 '12 at 17:45
  • I still love this function, but I just noticed it replaces anything not in the `li` list with `NA`. – Ari B. Friedman May 10 '13 at 11:13
  • @AriB.Friedman, that's what the `default` argument controls: `find.key(c(1,13), brands, default="missing")` => `[1] "Tylenol" "missing"` – huon May 10 '13 at 13:22
  • @dbaupp I was thinking something like `default=dat$product` and it would instead insert the particular product code that didn't have a label. But now that I think about it it's probably better to return `NA` and fill in the missings with the codes after the fact. – Ari B. Friedman May 10 '13 at 14:04
13

I like the recode function in the car package:

library(car)

dat$brand <- recode(dat$product,
  recodes="1:3='Tylenol';4:6='Advil';7:9='Bayer';10:12='Generic'")

# > dat
#    product   brand
# 1       11 Generic
# 2       11 Generic
# 3        9   Bayer
# 4        9   Bayer
# 5        6   Advil
# 6        1 Tylenol
# 7       11 Generic
# 8        5   Advil
# 9        7   Bayer
# 10      11 Generic
# 11       5   Advil
# 12      11 Generic
# 13       4   Advil
# 14       3 Tylenol
# 15      10 Generic
# 16       7   Bayer
# 17      10 Generic
# 18       5   Advil
# 19       9   Bayer
# 20       8   Bayer
BenBarnes
  • 19,114
  • 6
  • 56
  • 74
  • 9
    the only problem with `recode` is that it works by processing strings, so if your codes/data happen to have semicolons and = signs in them it's a big headache ... – Ben Bolker May 03 '12 at 13:25
8

I often use the technique below:

key <- c()
key[1:3] <- "Tylenol"
key[4:6] <- "Advil"
key[7:9] <- "Bayer"
key[10:12] <- "Generic"

Then,

> key[dat$product]
 [1] "Generic" "Generic" "Bayer"   "Bayer"   "Advil"   "Tylenol" "Generic" "Advil"   "Bayer"   "Generic"
[11] "Advil"   "Generic" "Advil"   "Tylenol" "Generic" "Bayer"   "Generic" "Advil"   "Bayer"   "Bayer"  
kohske
  • 65,572
  • 8
  • 165
  • 155
7

The "database approach" is to keep a separate table (a data.frame) for your product keys definitions. It makes even more sense since you say your product keys translate into not only a brand, but also a size:

product.keys <- read.table(textConnection("

product brand   size
1       Tylenol small
2       Tylenol medium
3       Tylenol large
4       Advil   small
5       Advil   medium
6       Advil   large
7       Bayer   small
8       Bayer   medium
9       Bayer   large
10      Generic small
11      Generic medium
12      Generic large

"), header = TRUE)

Then, you can join your data using merge:

merge(dat, product.keys, by = "product")
#    product   brand   size
# 1        1 Tylenol  small
# 2        3 Tylenol  large
# 3        4   Advil  small
# 4        5   Advil medium
# 5        5   Advil medium
# 6        5   Advil medium
# 7        6   Advil  large
# 8        7   Bayer  small
# 9        7   Bayer  small
# 10       8   Bayer medium
# 11       9   Bayer  large
# 12       9   Bayer  large
# 13       9   Bayer  large
# 14      10 Generic  small
# 15      10 Generic  small
# 16      11 Generic medium
# 17      11 Generic medium
# 18      11 Generic medium
# 19      11 Generic medium
# 20      11 Generic medium

As you notice, the order of the rows is not preserved by merge. If this is a problem, the plyr package has a join function that does preserve the order:

library(plyr)
join(dat, product.keys, by = "product")
#    product   brand   size
# 1       11 Generic medium
# 2       11 Generic medium
# 3        9   Bayer  large
# 4        9   Bayer  large
# 5        6   Advil  large
# 6        1 Tylenol  small
# 7       11 Generic medium
# 8        5   Advil medium
# 9        7   Bayer  small
# 10      11 Generic medium
# 11       5   Advil medium
# 12      11 Generic medium
# 13       4   Advil  small
# 14       3 Tylenol  large
# 15      10 Generic  small
# 16       7   Bayer  small
# 17      10 Generic  small
# 18       5   Advil medium
# 19       9   Bayer  large
# 20       8   Bayer medium

Finally, if your tables are large and speed is an issue, consider using data.tables (from the data.table package) instead of data.frames.

flodel
  • 87,577
  • 21
  • 185
  • 223
6

This one takes some typing but if you really do have a huge data set this may be the way to go. Bryangoodrich and Dason at talkstats.com taught me this one. It's using a hash table or creating a environment that contains a look up table. I actually keep this one on my .Rprofile (the hash function that is) for dictionary type look ups.

I replicated your data 1000 times to make it a bit larger.

#################################################
# THE HASH FUNCTION (CREATES A ENW ENVIRONMENT) #
#################################################
hash <- function(x, type = "character") {
    e <- new.env(hash = TRUE, size = nrow(x), parent = emptyenv())
    char <- function(col) assign(col[1], as.character(col[2]), envir = e)
    num <- function(col) assign(col[1], as.numeric(col[2]), envir = e)
    FUN <- if(type=="character") char else num
    apply(x, 1, FUN)
    return(e)
}
###################################
# YOUR DATA REPLICATED 1000 TIMES #
###################################
dat <- dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L, 
    7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA, 
    -20L), class = "data.frame")
dat <- dat[rep(seq_len(nrow(dat)), 1000), , drop=FALSE]
rownames(dat) <-NULL
dat
#########################
# CREATE A LOOKUP TABLE #
#########################
med.lookup <- data.frame(val=as.character(1:12), 
    med=rep(c('Tylenol', 'Advil', 'Bayer', 'Generic'), each=3))  

########################################
# USE hash TO CREATE A ENW ENVIRONMENT #
########################################  
meds <- hash(med.lookup)  

##############################
# CREATE A RECODING FUNCTION #
##############################          
recoder <- function(x){
    x <- as.character(x) #turn the numbers to character
    rc <- function(x){
       if(exists(x, env = meds))get(x, e = meds) else NA 
    }  
    sapply(x, rc, USE.NAMES = FALSE) 
}
#############
# HASH AWAY #
#############
recoder(dat[, 1])    

In this case hashing is slow but if you have more levels to recode then it will increase in speed over others.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
3

Somewhat more readable than nested ifelse's:

unlist(lapply(as.character(dat$product), switch,
              `1`=,`2`=,`3`='tylenol',
              `4`=,`5`=,`6`='advil',
              `7`=,`8`=,`9`='bayer',
              `10`=,`11`=,`12`='generic'))

Caveat: not very efficient.

Ernest A
  • 7,526
  • 8
  • 34
  • 40
2

I tend to use this function:

recoder <- function (x, from = c(), to = c()) {
  missing.levels <- unique(x)
  missing.levels <- missing.levels[!missing.levels %in% from]
  if (length(missing.levels) > 0) {
    from <- append(x = from, values = missing.levels)
    to <- append(x = to, values = missing.levels)
  }
  to[match(x, from)]
}

As in:

recoder(x = dat$product, from = 1:12, to = c(rep("Product1", 3), rep("Product2", 3), rep("Product3", 3), rep("Product4", 3)))
1

If you have codes in sequential groups like in the example, this may cut the mustard:

cut(dat$product,seq(0,12,by=3),labels=c("Tylenol","Advil","Bayer","Generic"))
 [1] Generic Generic Bayer   Bayer   Advil   Tylenol Generic Advil   Bayer  
[10] Generic Advil   Generic Advil   Tylenol Generic Bayer   Generic Advil  
[19] Bayer   Bayer  
Levels: Tylenol Advil Bayer Generic
James
  • 65,548
  • 14
  • 155
  • 193
0

There's also arules:discretize, but I like it less because it makes you separate the labels from the range of values:

library(arules)
discretize( dat$product, method = "fixed", categories = c( 1,3,6,9,12 ), labels = c("Tylenol","Advil","Bayer","Generic") )

[1] Generic Generic Generic Generic Bayer   Tylenol Generic Advil   Bayer   Generic Advil   Generic Advil   Advil   Generic Bayer   Generic Advil   Generic Bayer  
Levels: Tylenol Advil Bayer Generic
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
0

For completeness (and probably fastest and simplest solution) one can create and named vector and use it for lookup. Credit: http://adv-r.had.co.nz/Subsetting.html#applications

product.code <- c(1='Tylenol', 2='Tylenol', 3='Tylenon', 4='Advil', 5 ='Advil', 6='Advil', 7='Bayer', 8='Bayer', 9='Bayer', 10='Generic', 11='Generic', 12='Generic')

To get the output

$unname(product.code[dat$product])

Bench-marking for speed with the top solutions

$microbenchmark(
 named_vector = unname(product.code[dat$product]), 
 find.key = find.key(dat$product, brands),
 levels = `levels<-`(factor(dat$product),brands))
Unit: microseconds
         expr     min       lq      mean   median       uq     max neval
 named_vector  11.777  20.4810  26.12832  23.0410  28.1610 207.360   100
     find.key  34.305  55.8090  58.75804  59.1370  65.5370 130.049   100
       levels 143.361 224.7685 234.02545 247.5525 255.7445 338.944   100

This solution is very similar to @kohske's solution but would work for non-numerical lookup.

Joe
  • 8,073
  • 1
  • 52
  • 58
imsc
  • 7,492
  • 7
  • 47
  • 69
  • You need to call `unname(product.code[as.character(dat$product]))` in order for this to make sense. Then it becomes quite a general solution. – Andre Elrico Oct 30 '18 at 13:53
0

Another version, that would work in this case:

c("Tylenol","Advil","Bayer","Generic")[(dat$product %/% 3.1) + 1]
Andre Elrico
  • 10,956
  • 6
  • 50
  • 69
0

Canonical tidyverse method to update some values of a vector from a look-up table

Based on the code from above question, the question has a good solution by look-up table and left_join in tidyverse style.

library(tidyverse)

First, we should set a look-up table. You could set the look-up table in a csv file for shortest human time.

lookup <- tribble(
  ~brand, ~code, ~size,
  "Tylenol", 1, "small",
  "Tylenol", 2, "medium",
  "Tylenol", 3, "large",
  "Advil", 4, "small",
  "Advil", 5, "medium",
  "Advil", 6, "large",
  "Bayer", 7, "small",
  "Bayer", 8, "medium",
  "Bayer", 9, "large",
  "Generic", 10, "small",
  "Generic", 11, "medium",
  "Generic", 12, "large"
)

Then, we can create new variables using the lookup table.

newdat <- dat %>% 
  left_join(lookup, by = c("product" = "code")) %>%
  select(-product)

Although this answer may be outdated, since it was the first question I encountered when searching for such answer, and it took me hours to search for an tidy answer, I'm sure my answer made some sense.

James
  • 21
  • 4