Which regular expression is more appropriate?

Question

I am trying to make models output prettier with pre-defined labels for my variables. I have a vector of variable names (a), a vector of labels (b) and model terms (c).

I have to match the vectors (a) and (c) and replace (a) by (b). I found this question that introduced me to the function gsubfn from the package library(gsubfn). The function match and replace multiple strings. Following their example, it did not work properly in my case:

library(gsubfn)

a <- c("ecog.ps", "resid.ds", "rx")
b <- c("ECOG-PS", "Residual Disease", "Treatment")
c <- c("ecog.psII", "rxt2", "ecog.psII:rxt2")

gsubfn("\\S+", setNames(as.list(b), a), c)
[1] "ecog.psII"      "rxt2"           "ecog.psII:rxt2"

If I use a specific pattern, then it works:

gsubfn("ecog.ps", setNames(as.list(b), a), c)
[1] "ECOG-PSII"      "rxt2"           "ECOG-PSII:rxt2"

So I guess my problem is the regular expression used as the argument pattern in the function gsubfn. I checked this R-pub, and Hadley's book for regular expressions. It seems that \S+ is adequate. I tried other regular expressions without success:

gsubfn("[:graph:]", setNames(as.list(b), a), c)
[1] "ecog.psII"      "rxt2"           "ecog.psII:rxt2"

gsubfn("[:print:]", setNames(as.list(b), a), c)
[1] "ecog.psII"      "rxt2"           "ecog.psII:rxt2"

Which pattern should be used in the function gsubfn to match the vectors (a) and (c) and replace (a) by (b)?

No, `\S+` is not a good pattern, it matches more than you need. What are the exact *pattern* requirements? As an example, try `pat <- paste(a, collapse="|")` and then `gsubfn(pat, setNames(as.list(b), a), c)`. — Wiktor Stribiżew, Jan 24 '18 at 17:32
I cannot provide a solution (no one can) until you clarify what contexts you need to find and replace in. — Wiktor Stribiżew, Jan 24 '18 at 18:52
@WiktorStribizew, it worked perfectly! Thank you. I guess my understanding of character classes was not very clear. My context is that I am trying to make models output prettier with pre-defined labels for my variables. So I am extracting variable names (a) from a dataset, labels (b) from attributes of my dataset and model terms (c) from a coxph object using `broom::tidy` Sorry, it was not clear. I will edit my question. — Márcio Augusto Diniz, Jan 24 '18 at 19:19

score 1 · Accepted Answer · answered Jan 24 '18 at 19:27

The \S+ pattern fully matches ecog.psII and ecog.psII:rxt2 and the list has no items with such names. You may create a pattern dynamically from the a vector and use it to find the matches you need.

Use

pat <- paste(a, collapse="|")
## Or, if there can be special chars that must be escaped (note . must also be escaped)
pat <- paste(gsub("([][/\\\\^$*+?.()|{}-])", "\\\\\\1", a), collapse="|")
## => ecog\.ps|resid\.ds|rx

and then use

gsubfn(pat, setNames(as.list(b), a), c)

If you do not escape special chars, you may overmatch (since . matches any char), match wrong strings (if there are quantifiers or other regex operators) or an error might occur (if there are chars like (, ), unpaired [, etc.).

Which regular expression is more appropriate?

1 Answers1