1

For the longest time I have used code similar to the following in order to replace values in a vector based on a match in a "lookup" table. In this example, I am replacing values in the input object by corresponding values in the key object if they match any values in the second column of key.

key<-cbind(c("one","one","two","three","four","five"),c("one1","one11","two2","three3","over","over"))

input<-c("one1","one11","three","four","five")

input[which(!is.na(match(key[,2],input)))]<-key[!is.na(match(key[,2],input)),1]

Is there a more efficient way of accomplishing this? The merge function does not seem to work. The method here does not work when there is not a one-to-one match between key and input.

Community
  • 1
  • 1
jsta
  • 3,216
  • 25
  • 35

1 Answers1

4

Your code isn't quite correct:

  • note that the match(key[, 2], input) in the index on the LHS is of length 6 (the length of key) not 5 (the length of input), and so !is.na() is of length 6 not 5, and which(!is.na()) is an index into key, not into input.
  • you additionally lose the order of the matches. by using !is.na() on the right hand side (it works in your example because the rows of key happen to be the same indices as the things to replace in input, and in the same order).

As an illustrative example, let's shuffle your key

key <- key[c(3,2,4,5,6,1), ]
input[which(!is.na(match(key[,2],input)))]<-key[!is.na(match(key[,2],input)),1]
input
[1] "one1"  "one"   "three" "four"  "five"  "one"  

Note how your new input has 6 variables now, and the first one1 wasn't replaced. Have a look at match(key[,2], input), is.na(...) and which(is.na(...)) to see why.

You need to use match(input, key[,2]) which is non-NA when input[i] has a match in key, and has the value of the index into key. So now you can use !is.na() on the LHS to do the assignment, but don't use !is.na() on the right or you lose the indices of the matches in key.

m <- match(input, key[,2]) # 6 2 NA NA NA for the shuffled `key`
input[!is.na(m)] <- key[na.omit(m), 1]

# or a one-liner
input[!is.na(match(input, key[,2]))] <- key[na.omit(match(input, key[,2])), 1]

In terms of "more efficient", I reckon this is about as good as it gets - merge calls match internally anyway, so will most certainly be slower. It ain't "elegant", but it's fast.

The only improvement I see is to store the match first (like I have done above, storing the match in m) to avoid calling it twice.

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • Thanks for the tip about `na.omit`. My main issue is that it is difficult when reading through code to tell at a glance what is happening on these lines. – jsta Aug 19 '15 at 11:40
  • Yes, I agree, it is hard to tell what is happening when reading back. That's why I usually go for the two-liner when doing this sort of thing, and I also tend to comment copiously when there's lots of `match(..)` going on. – mathematical.coffee Aug 20 '15 at 01:50