0

I'm trying to match a partial pattern of the variable names in my data set and replace them all with another pattern using gsubfn().

I'm using R version 4.0.3 (2020-10-10).

The below code shows the sample pattern of variable names in the data set and how I tried to replace them

replace_str = c("Race..American.India", "Race.White")
gsubd_str = gsubfn(pattern = "Race..| Race.", "R_", x = replace_str)

When I used the pattern string as above, my output is:

> gsubd_str
[1] "R_American.India" "R_hite"

However, if I use (I changed the order of pattern to match):

gsubd_str = gsubfn(pattern = "Race.| Race..", "R_", x = replace_str)

then my output is:

gsubd_str
[1] "R_.American.India" "R_White"

In both the cases, my thoughts are that gsubfn() is not behaving as expected. At least in the second case, gsubfn() replaced the variable as soon as the LHS of "|" was TRUE. However, in the first case, after the match was found, gsubfn() replaced 3 characters "R", "." , "W" instead of 2, "R" and ".".

Not sure if I understood gsubfun() correctly.

Usha Kota
  • 43
  • 9

1 Answers1

2

It is the space you added. The behavior of gsubfn is exactly like gsub as the documentation states:

# with the space
x <-  c("Race..American.India", "Race.White")
gsub("Race..| Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsub("Race.| Race..", "R_", x)
#R> [1] "R_.American.India" "R_White" 

# without the space
gsub("Race..|Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsub("Race.|Race..", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsubfn("Race..|Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsubfn("Race..|Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  

Though, you can just do:

gsub("Race..?", "R_", x)
#R> [1] "R_American.India" "R_hite"

You also might like to use \\.. Otherwise, you may end up strange results like:

gsub("Race..?", "R_", c("Racehorses", "Racecourse", "Racerunner"))
#R> [1] "R_rses" "R_urse" "R_nner"
gsub("Race\\.\\.?", "R_", c("Racehorses", "Racecourse", "Racerunner"))
#R> [1] "Racehorses" "Racecourse" "Racerunner"

# still works
gsub("Race\\.\\.?", "R_", x)
#R> [1] "R_American.India" "R_White"

Original answer

In both the cases, my thoughts are that gsubfn() is not behaving as expected. ...

Yes, this seems like an issue with gsubfn. It works with gsub as shown below. A work around is to change the regular expression to "Race..?":

# works fine w/ gsub
x <-  c("Race..American.India", "Race.White")
gsub("Race..| Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsub("Race.|Race..", "R_", x)
#R> [1] "R_American.India" "R_hite" 

# does not work with gsubfn
library(gsubfn)
gsubfn("Race..| Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"
gsubfn("Race.| Race..", "R_", x)
#R> [1] "R_.American.India" "R_White" 

# you can do
gsubfn("Race..?", "R_", x)
#R> [1] "R_American.India" "R_hite" 

It is clearly stated in the manual page of gsubfn that:

If replacement is a string then it acts like gsub.

Thus, this must be a bug or maybe this is the catch from the documentation:

Note that if the "R" engine is used and if backref is non-negative then internally the pattern will be parenthesized.

  • I cannot consider this to be a bug. Probably you should consider the engine being used. The default R uses POSIX engine and one can use the PCRE when perl =TRUE. Note that different engines will behave differently. You need to have a robust expression in order to have correct results across multiple engines – Onyambu Nov 18 '20 at 09:04
  • If the documentation states that _if replacement is a string then it acts like gsub_ then it seems like a bug to me if does not act like `gsub`? It seems though that I did not notice the space. See the updated answer. – Benjamin Christoffersen Nov 18 '20 at 09:55
  • Thank you @Benjamin Christoffersen, Onyambu., However, even if the *space* character were to be a problem, gsubfun(), did not replace *Race..* with *R_8 – Usha Kota Nov 19 '20 at 06:08
  • There was a typo in my comment above, let me check the use cases again – Usha Kota Nov 19 '20 at 06:15
  • @Benjamin , I checked the variable in your example, the original strings in x do not contain a space character to match... – Usha Kota Nov 19 '20 at 06:22
  • @UshaKota, if I undestand you, no `x` does not contain a space (as in your example) but you use `"Race..| Race."` with space and not `"Race..|Race."`. That is, for the first argument of `gsubfn`. The former does not yield the result you want. – Benjamin Christoffersen Nov 19 '20 at 07:41
  • I repeated all the use cases that Benjamin's Original answer, my output matches with him, except that one use case has been missed out with gsub() :- that is to reverse the order of *pattern* with the space, whether space is to the LHS or RHS of | the output is the same and I assume that *R..* should also be replaced as *R_*. So I see 2 issues here, a) stripping 3 characters when there is no space ib) does not replace all 3 characters when there is a space – Usha Kota Nov 19 '20 at 09:02
  • my learning says, *space with the logical operator*, as in ```Race..| Race.``` should not effect the output , also, ```Race..|Race.``` or ```Race.|Race...``` should yield the same ouput. But I understand that *space* is treated as a character appended to *Race.* to make it a different pattern string, but this is is also incorrect, since the input string does not match with the pattern string hence there should not be any replacement. Thank you for all the detailed answers. An additional use case to R gsubfn() – Usha Kota Nov 19 '20 at 09:12
  • I am happy to help! Please, feel free to click the check marker if you find that I answered your question. – Benjamin Christoffersen Nov 20 '20 at 05:27