2

Search-and-replace an element in a data frame given a list of replacements.

Code:

testing123tmp <- data.frame(x=c("it's", "not", "working"))
testing123tmp$x <- as.character(testing123tmp$x)
tmp <- list("it's" = "hey", "working"="dead")
apply(testing123tmp,2,function(x) gsubfn('.', tmp, x))

Expected Output:

      x        
[1,] hey   
[2,] not  
[3,] dead

My current output:

     x        
[1,] "it's"   
[2,] "not"    
[3,] "working"

Been looking around for possible solution in chartr and gsub, but would like simplicity (short coding) given multiple gsub is required for such operation. Also my variable tmp can be scaled to many-pair replacement such that:

tmp <- list("it's" = "hey", 
            "working"="dead",
            "other" = "other1",
             .. = .. ,
             .. = .. ,
             .. = .. )

Edit/Update #1:

  • would also like solution in gsubfn above and data-framed
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
beavis11111
  • 576
  • 1
  • 7
  • 19

2 Answers2

4

The issues are these:

  • The dot only matches one character so it will never match an entire string unless that entire string has one character and therefore no name in tmp will ever be matched. Use ".*" to match the entire string. If you wanted to match words, i.e. there are possibly several words separated by whitespace in each component of x so that for example one component of x might be "it's not" and we still wanted to match it's then use "\\S+". There are other variations one could imagine as well and this gives a framework that encompasses many of them.

  • the third argument to gsubfn can already be a vector and gsubfn will iterate over it so it is not necessary to use apply. (It will still work with apply but it is unnecessary.)

  • to keep everything in a data frame one easy way is to use transform as shown below (or alternately use transform2, also in the gsubfn package). The x will automatically refer to the x column in the testing123tmp data frame and transform will produce a new data frame not overwriting the original. If you want to keep these separate assign the result of transform to a new name or if you want to overwrite testing123tmp then assign it back to testing123tmp.

  • we can use stringsAsFactors = FALSE to avoid generating character columns.

    testing123tmp <- data.frame(x=c("it's", "not", "working"), stringsAsFactors = FALSE)
    

Thus we can reduce the code to:

transform(testing123tmp, y = gsubfn(".*", tmp, x))

giving the following data.frame:

        x    y
1    it's  hey
2     not  not
3 working dead

If we wanted to overwrite the x column rather than keep separate input and output columns we could have used x = ... in the transform statement instead of y = ... .

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Beneath your phantastic packages and `R` books I even learn a lot on SO. Thanks, +1. – Jan Mar 27 '18 at 14:57
2

You may write

gsubfn(".*", tmp, testing123tmp$x)
# [1] "hey"  "not"  "dead"

and then

testing123tmp$x <- gsubfn(".*", tmp, testing123tmp$x)

As for your approach, there was no need for apply as gsubfn is vectorized over that parameter, and the problem was to match only .---one symbol, while it's and working are of varying length.

However, if you are replacing one word with another word, then there is no need for regex. For instance,

idx <- testing123tmp$x %in% names(tmp)
testing123tmp$x[idx] <- unlist(tmp)[testing123tmp$x[idx]]

should work faster. If the task is more involved, then I guess

library(stringr)
str_replace_all(testing123tmp$x, unlist(tmp))
# [1] "hey"  "not"  "dead"

should be more robust than gsubfn as you don't need to deal with patterns like .*.

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102