4

In R, I am attempting to write code that will work on any adaptations of a string pattern. An example of a string is:

string <- "y ~ 1 + a + (b | c) + (d^2) + e + (1 | f) + g"

I would like to remove ONLY the portions that contain a pattern of "(, |, )" such as:

(b | c) and (1 | f)

and be left with:

"y ~ 1 + a + (d^2) + e + g"

Please note that the characters could change values (e.g., 'b' could become '1' and 'c' could become 'predictor') and I would like the code to still work. Spaces are also not required for the string, it could also be "y~1+a+(b|c)+(d^2)+e+(1|f)+g" or any combination of space/no-space thereof. The order of the characters could change as well to be "y~1+a+(b|c)+e+(1|f)+(d^2)+g".

I have tried using base R string manipulation functions (gsub and sub) to search for the pattern of "(, |, )" by using variations of the pattern such as:

"\\(.*\\|.*\\)"
"\\(.*\\|"
"\\(.+\\|.+\\)"
"\\|.+\\)"

as well as many of the stringr functions to find and replace this pattern with a blank. However, using both base R and stringr what happens when I do this is that it removes EVERYTHING, for example:

gsub("\\(.*\\|.*\\)", "", string)

produces:

"y ~ 1 + a +  + g"

and

gsub("\\(.*\\|", "", string)

produces:

"y ~ 1 + a +  f) + g"

I have additionally tried using the str_locate functions but am running into issues using that efficiently since there are multiple sets of parentheses and I want the locations only of the instances with a "|" between them.

Any help is greatly appreciated.

markus
  • 25,843
  • 5
  • 39
  • 58

3 Answers3

7

1) gsubfn Define a function which returns an empty string or its input depending on whether the input has a | or not and run gsubfn with it. gsubfn is like gsub except the replacement string can be a function which takes the match as input and replaces it with the function's output.

library(gsubfn)

pick <- function(x) if (grepl("|", x, fixed = TRUE)) "" else trimws(x)
gsubfn("[+] *[(].*?[)]", pick, string, perl = TRUE)
## [1] "y ~ 1 + a  + (d^2) + e  + g"

2) Base R Split the input into terms and grep out the ones without |. Then put what is left back together using reformulate.

s <- trimws(grep("\\|", strsplit(string, "[~+]")[[1]], invert = TRUE, value = TRUE))
reformulate(format(s[-1]), s[1])
## y ~ 1 + a + (d^2) + e + g

3) getTerms This also uses only base R but first converts the string to an expression representing a formula and parses it using getTerms found in this SO post: Terms of a sum in a R expression It then proceeds as in (2).

p <- parse(text = string)[[1]]
s <- grep("\\|", getTerms(p[[3]]), value = TRUE, invert = TRUE)
reformulate(s, p[[2]])
## y ~ 1 + a + (d^2) + e + g
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

Using gsub we can achieve the desired results.

model_texts <- c("y ~ 1 + a + (b | c) + (d^2) + e + (1 | f) + g",
"y~1+a+(b|c)+(d^2)+e+(1|f)+g" ,                 
"y~1+a+(b|c)+e+(1|f)+(d^2)+g" )   

pattern <- "\\(\\w ?\\| ?\\w ?\\) ?\\+ ?"

# tests

vapply(model_texts, function(x) gsub(pattern, "", x), "")

    "y ~ 1 + a + (d^2) + e + g" 
    "y~1+a+(d^2)+e+g" 
    "y~1+a+e+(d^2)+g" 



Eyayaw
  • 1,033
  • 5
  • 10
0

You could use gsub with the following regular expression to replace matches with empty strings.

"^\\([^|)]+\\|[^)]+\\) *\\+ ?| \\+? *\\([^|)]+\\|[^)]+\\)"

Start your R engine!

This regex is simple in the sense that it contains no lookarounds or more advanced regex features, so it does not require perl=TRUE. This causes the string:

(h|i) + y ~ 1 + a + (b | c) + (d^2) + e + (1 | f) + g +(j+k| m)

to become1:

y ~ 1 + a  + (d^2) + e  + g

The first part of the alternation,

^\\([^|)]+\\|[^)]+\\) *\\+ ?

is needed in case (..|..) begins the string (as does (h|i) in my example), in which case it is not preceded by a plus sign.

The following link to regex101.com uses the equivalent regex for the PCRE (PHP) engine. I've included that to allow the reader to examine how each part of the regex works. (Move the cursor around to see interesting details pop up magically on the screen.)

Start your PCRE engine!

1. Notice there is an extra space after 'a' and 'e'. I've assumed that is not a problem.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100