2

I want to be able to use grepl() and gsub() only outside of given sets of delimiters, for instance I want to be able to ignore text between quotes.

Here is my desired output :

grepl2("banana", "'banana' banana \"banana\"", escaped =c('""', "''"))
#> [1] TRUE
grepl2("banana", "'banana' apple \"banana\"", escaped =c('""', "''"))
#> [1] FALSE
grepl2("banana", "{banana} banana {banana}", escaped = "{}")
#> [1] TRUE
grepl2("banana", "{banana} apple {banana}", escaped = "{}")
#> [1] FALSE

gsub2("banana", "potatoe", "'banana' banana \"banana\"")
#> [1] "'banana' potatoe \"banana\""
gsub2("banana", "potatoe", "'banana' apple \"banana\"")
#> [1] "'banana' apple \"banana\""
gsub2("banana", "potatoe", "{banana} banana {banana}", escaped = "{}")
#> [1] "{banana} potatoe {banana}"
gsub2("banana", "potatoe", "{banana} apple {banana}", escaped = "{}")
#> [1] "{banana} apple {banana}"

Real cases might have quoted substrings in different amounts and order.

I have written the following functions which work for these cases, but they are clunky and gsub2() is not robust at all as it replaces the delimited content with placeholders temporarily, and these placeholders might be affected by subsequent operations.

regex_escape <-
function(string,n = 1) {
  for(i in seq_len(n)){
    string <- gsub("([][{}().+*^$|\\?])", "\\\\\\1", string)
  }
  string
}

grepl2 <- 
  function(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, 
           useBytes = FALSE, escaped =c('""', "''")){
    escaped <- strsplit(escaped,"")
    # TODO check that "escaped" delimiters are balanced and don't cross each other
    for(i in 1:length(escaped)){
      close <- regex_escape(escaped[[i]][[2]])
      open <- regex_escape(escaped[[i]][[1]])
      pattern_i <- sprintf("%s.*?%s", open, close)
      x <- gsub(pattern_i,"",x)
    }
    grepl(pattern, x, ignore.case, perl, fixed, useBytes)
  }

gsub2 <- function(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, 
                   fixed = FALSE, useBytes = FALSE, escaped =c('""', "''")){
  escaped <- strsplit(escaped,"")
  # TODO check that "escaped" delimiters are balanced and don't cross each other
  matches <- character()
  for(i in 1:length(escaped)){
    close <- regex_escape(escaped[[i]][[2]])
    open <- regex_escape(escaped[[i]][[1]])
    pattern_i <- sprintf("%s.*?%s", open, close)
    ind <- gregexpr(pattern_i,x)
    matches_i <- regmatches(x, ind)[[1]]
    regmatches(x, ind)[[1]] <- paste0("((",length(matches) + seq_along(matches_i),"))")
    matches <- c(matches, matches_i)
  }
  x <- gsub(pattern, replacement, x, ignore.case, perl, fixed, useBytes)
  for(i in seq_along(matches)){
    pattern <- sprintf("\\(\\(%s\\)\\)", i)
    x <- gsub(pattern, matches[[i]], x)
  }
  x
}

Is there a solution using regex and no placeholder ? Note that my current function supports multiple pairs of delimiters but I'll be satisfied by a solution that supports one pair only, and will not try to match substrings between simple quotes for instance.

It is also acceptable, to impose different delimiters, for instance { and } rather than 2 " or 2 ' if it helps.

I am also fine with imposing perl = TRUE

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167

4 Answers4

3

I tried my hand at grepl2 but haven't had a crack (or thought of a clear solution) to gsub2 yet. Anyway, this just removes any characters (excluding new lines) between the shortest pairs of the provided escaped characters. It should scale fairly well, too. If you go with this solution, you may want to build-in a check to make sure there are pairs of escaped character with no spaces (or otherwise adapt for the use of substr(). Hope this helps!

grepl3 <- 
  function(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, 
           useBytes = FALSE, escaped =c('""', "''")){

    new_esc1 <- gsub("([][{}().+*^$|\\?])", "\\\\\\1", substr(escaped, 1, 1))
    new_esc2 <- gsub("([][{}().+*^$|\\?])", "\\\\\\1", substr(escaped, 2, 2))
    rm_pat <- paste0(new_esc1, ".*?", new_esc2, collapse = "|")
    new_arg <- gsub(rm_pat, "", arg)
    grepl(pattern, new_arg)

  }

grepl3(pattern = "banana", x = "'banana' apple \"banana\" {banana}", escaped =c("''", '""', "{}"))
[1] FALSE
Andrew
  • 5,028
  • 2
  • 11
  • 21
3

You can use the start/end_escape arguments to provide the LHS and RHS of matched delimiters such as { and } without matching them in the wrong place (} as the LHS delimiter)

perl = TRUE allows look-around assertions. These assess the validity of the statements within them, without capturing them in the pattern. This post covers them pretty well.

You'll get an error in perl = FALSE, because TRE, the default regex engine for R, does not support them.

  gsub3 <- function(pattern, replacement, x, escape = NULL, start_escape = NULL, end_escape = NULL) {
      if (!is.null(escape) || !is.null(start_escape)) 
      left_escape <- paste0("(?<![", paste0(escape, paste0(start_escape, collapse = ""), collapse = ""), "])")
      if (!is.null(escape) || !is.null(end_escape))
      right_escape <- paste0("(?![", paste0(escape, paste0(end_escape, collapse = ""), collapse = ""), "])")
      patt <- paste0(left_escape, "(", pattern, ")", right_escape)
      gsub(patt, replacement, x, perl = TRUE)
    }
    gsub3("banana", "potatoe", "'banana' banana \"banana\"", escape = "'\"")
    #> [1] "'banana' potatoe \"banana\""
    gsub3("banana", "potatoe", "'banana' apple \"banana\"", escape = '"\'')
    #> [1] "'banana' apple \"banana\""
    gsub3("banana", "potatoe", "{banana} banana {banana}", escape = "{}")
    #> [1] "{banana} potatoe {banana}"
    gsub3("banana", "potatoe", "{banana} apple {banana}", escape = "{}")
    #> [1] "{banana} apple {banana}"

Below is grepl3 - note this doesn't need perl = TRUE since we don't care what the pattern captures, just that it matches.

grepl3 <- function(pattern, x, escape = "'", start_escape = NULL, end_escape = NULL) {
  if (!is.null(escape) || !is.null(start_escape)) 
  start_escape <- paste0("[^", paste0(escape, paste0(start_escape, collapse = ""), collapse = ""), "]")
  if (!is.null(escape) || !is.null(end_escape))
  end_escape <- paste0("[^", paste0(escape, paste0(end_escape, collapse = ""), collapse = ""), "]")
  patt <- paste0(start_escape, pattern, end_escape)
  grepl(patt, x)
}

grepl3("banana", "'banana' banana \"banana\"", escape =c('"', "'"))
#> [1] TRUE
grepl3("banana", "'banana' apple \"banana\"", escape =c('""', "''"))
#> [1] FALSE
grepl3("banana", "{banana} banana {banana}", escape = "{}")
#> [1] TRUE
grepl3("banana", "{banana} apple {banana}", escape = "{}")
#> [1] FALSE

Edit:

This should solve the gsub without the problem mentioned by Andrew, as long as you are okay with a single set of paired operators. I think you could modify it to allow multiple delimiters though. Thanks for the fascinating problem, Found a new gem in regmatches!

gsub4 <-
  function(pattern,
           replacement,
           x,
           left_escape = "{",
           right_escape = "}") {
    # `regmatches()` takes a character vector and
    # output of `gregexpr` and friends and returns
    # the matching (or unmatching, as here) substrings
    string_pieces <-
      regmatches(x,
                 gregexpr(
                   paste0(
                     "\\Q",  # Begin quote, regex will treat everything after as fixed.
                     left_escape,
                     "\\E(?>[^", # \\ ends quotes.
                     left_escape,
                     right_escape,
                     "]|(?R))*", # Recurses, allowing nested escape characters
                     "\\Q",
                     right_escape,
                     "\\E",
                     collapse = ""
                   ),
                   x,
                   perl = TRUE
                 ), invert =NA) # even indices match pattern (so are escaped),
                                # odd indices we want to perform replacement on.
for (k in seq_along(string_pieces)) {
    n_pieces <- length(string_pieces[[k]])
  # Due to the structure of regmatches(invert = NA), we know that it will always
  # return unmatched strings at odd values, padding with "" as needed.
  to_replace <- seq(from = 1, to = n_pieces, by = 2)
  string_pieces[[k]][to_replace] <- gsub(pattern, replacement, string_pieces[[k]][to_replace])
}
    sapply(string_pieces, paste0, collapse = "")
  }
gsub4('banana', 'apples', "{banana's} potatoes {banana} banana", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes {banana} apples"
gsub4('banana', 'apples', "{banana's} potatoes {banana} banana", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes {banana} apples"
gsub4('banana', 'apples',  "banana's potatoes", left_escape = "{", right_escape = "}")
#> [1] "apples's potatoes"
gsub4('banana', 'apples', "{banana's} potatoes", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes"
smingerson
  • 1,368
  • 9
  • 12
  • looks great! I'll tick the mark as soon as I have time to check it – moodymudskipper Nov 09 '19 at 14:42
  • Very clever! Just a heads up, for `grepl3`, this solution may run into issues if the escaped character is included between start/end escapes. E.g., if someone went `banana's` in the text (and `'` was one of the escaped characters. Something to keep in mind or update. – Andrew Nov 09 '19 at 16:36
  • A limitation of the gsub solution is that we can't use `"\\1"` as a replacement – moodymudskipper Jan 08 '20 at 16:05
  • You can use it but it will reset at each unescaped string piece. What is a practical example where you would need this? – smingerson Jan 09 '20 at 01:05
1

Here's a simple regex solution using the negation operator in the character class. It only satisfies your simple case. I wasn't able to make it satisfy the paired multiple delimiter request:

grepl2 <- function(patt, escape="'", arg=NULL) {
             grepl( patt=paste0("[^",escape,"]", 
                                patt,
                                "[^",escape,"]"), arg) }

grepl2("banana", "'banana' apple \"banana\"", escape =c( "'"))
#[1] TRUE

grepl2("banana", "'banana' apple ", escape =c( "'"))
[#1] FALSE
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Under this, `gsub2("banana", "potatoe", "{banana} banana {banana}", escaped = "{}")` will result in `"{banana}potatoe{banana}"`. That's the problem with negating a character set. – smingerson Nov 09 '19 at 02:44
  • First, I'm not clear why thaty is the wrong answer. And... You really have two separate requests. You should split your question into a `grepl` versions and a `gsub` version. – IRTFM Nov 09 '19 at 02:50
  • It is stripping the spaces around the interior banana, because the pattern you're matching is saying "Not the escape characters". Thus the spaces are matched and also removed. If you include the spaces in the character set, you'll get no match at all. – smingerson Nov 09 '19 at 02:54
  • Fine. Make two questions. – IRTFM Nov 09 '19 at 03:31
  • It's my question, not his :). I think these questions have too much in common to be separate. grepl is easier but a gsub solution will probably solve the grepl one as well, and grepl solutions can give clues for gsub – moodymudskipper Nov 09 '19 at 10:21
1

My opinion is that you might need to separate the opening and closing brackets to make the code work properly. Here I am making use of regex lookaround feature. This might not work universally (especially the lookback ?< matching operator) outside R.

grepl2 = function(pattern, x, escapes = c(open="\"'{", close="\"'}")){
     grepl(paste0("(?<![", escapes[[1]], "])",
                  pattern, 
                  "(?![", escapes[[2]], "])"), 
           x, perl=T)
}
grepl2("banana", "'banana' banana \"banana\"")
#> [1] TRUE
grepl2("banana", "'banana' apple \"banana\"")
#> [1] FALSE
grepl2("banana", "{banana} banana {banana}")
#> [1] TRUE
grepl2("banana", "{banana} apple {banana}")
#> [1] FALSE
T.Dong
  • 11
  • 2