0

I have a long list of identification codes, at some point it was discovered that some but not all of the identification codes had been mixed up by mistake, the mistake was mapped out and the correct ID codes their incorrect partners identified. Now everything has to be made correct.

However the list of codes (both correct and mixed up) is very long and their are multiple entry for each ID code as well as being a lot of ID codes to correct. I have found various solutions for replacing multiple values but they mostly seem to involve typing in the mapping instead of comparing two vectors, see: Dictionary style replace multiple items in R

That is fine if you can do 1 to 1 mapping of everything or don't mind writing everything out when there are a lot of entries that stops being so great. The solution I have made is the following:

Set up data set and "translation" vectors:

y <- cbind(paste(letters, letters, sep=""), seq(1:26))
y[6,1] <- "a"
current <- c( "aa", "ee", "kk", "mm")
tmp <- c("11", "22", "33", "44")
correct <-c("ee", "mm", "zz", "aa")

replacement solution:

for (i in 1:length(unique(current))) {
y[,1] <- sub(current[i], tmp[i],y[,1])
}
for (i in 1:length(unique(current))) {
y[,1] <- sub(tmp[i], correct[i],y[,1])
}

Is there a way to make this more efficient?

Thanks for the help!

Community
  • 1
  • 1
Jonno Bourne
  • 1,931
  • 1
  • 22
  • 45
  • I'm a little confused about what your different vectors represent ... specifically `temp`. I guess I don't understand the relationship between what you have, and what you want to have (i.e., what is the relationship between `current` and `correct` – is it simply that the first element in `current` should instead be `"ee"`, as that is the first element in `correct`?). – rbatt May 09 '14 at 15:17
  • Apologies for the lack of clarity. "Current" is the list of codes which are incorrect and "Correct" is the correct values for those codes. e.g code "ee" in the vector y should actually be "mm". The tmp vector is a transfer stage to prevent all "aa" being replaced by "ee" then all "ee" being replaced by "mm" leaving no "ee" left at all. – Jonno Bourne May 11 '14 at 16:57

3 Answers3

2

Here is an alternative approach using match that does all the swapping at once do you don't need the temp variable

swap <- match(y[,1], current)
y[which(!is.na(swap)),1] <- correct[na.omit(swap)]

which produces the same results are your code. If appears to be more efficient by this benchmark

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Great answer, additional question though, why don't I need to specify the column of y when executing the second line of the answer? – Jonno Bourne May 12 '14 at 09:17
  • 1
    Actually, to be safer you should. I'll update my suggestion. Without the comma it would only work for the first column. The reason is that when you use a single index for a matrix, it starts indexing the values in column by column from 1:(2*nrow(y)). – MrFlick May 12 '14 at 12:53
1

One way to do this is to set the names of correct to current, then you can assign new values to them easily

names(correct) <- current
y[y[,1] %in% current,1] <- correct[y[y[,1] %in% current,1]]

breaking this down a bit:

y[,1] %in% current is a vector of which variables need to change

y[y[,1] %in% current,1] is the values to change

correct[y[y[,1] %in% current,1]] is the new value to insert ordered by how thy appear in y.

Miff
  • 7,486
  • 20
  • 20
1

Here is one approach:

library(gsubfn)
tmp2 <- as.list(correct)
names(tmp2) <- current

pat <- paste(current, collapse='|')

y[,1] <- gsubfn(pat,tmp2, y[,1])

This looks for any of the wrong codes, then looks up the current code in the conversion list (tmp2) and replaces it with the correct value.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110