7

I am looking for regex (preferably in R) which can replace (any number of) specific characters say ; with say ;; but only when not present inside parenthesis () inside the text string.

Note: 1. There may be more than one replacement character present inside parenthesis too

2. There are no nested parenthesis in the data/vector

Example

  • text;othertext to be replaced with text;;othertext
  • but text;other(texttt;some;someother);more to be replaced with text;;other(texttt;some;someother);;more. (i.e. ; only outside () to be replaced with replacement text)

Still if some clarification is needed I will try to explain

in_vec <- c("abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag", "zvc;dfasdf;asdga;asd(asd;hsfd)", "adsg;(asdg;ASF;DFG;ASDF;);sdafdf", "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa")

in_vec
#> [1] "abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag"
#> [2] "zvc;dfasdf;asdga;asd(asd;hsfd)"             
#> [3] "adsg;(asdg;ASF;DFG;ASDF;);sdafdf"           
#> [4] "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa"

Expected output (calculated manually)

[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag" 
[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"             
[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"            
[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • 1
    Please tell if any clarification is needed – AnilGoyal Jun 08 '21 at 12:04
  • 1
    Adding some code that you've tried might help make the process you're thinking of more clear – camille Jun 08 '21 at 14:13
  • @camille, sure I will add all my trials soon :) – AnilGoyal Jun 08 '21 at 14:14
  • 1
    I also wonder whether there's a solution to this upstream—does the text have to be formatted this way to begin with? Or is it a situation you have no control over? – camille Jun 08 '21 at 14:17
  • What about nested `(`? What should the output be of, say, `"asagf;(fafgf;(sadg);sdag;a;gddfg;fd)gsfg;sdfa"`? – nicola Jul 22 '21 at 09:04
  • 1
    @nicola, there are no nested parenthesis in the data/vector. – AnilGoyal Jul 22 '21 at 09:10
  • 1
    @AnilGoyal Great, I think you should add this info to the question, since it's very relevant for building a solution (would have been much harder with nested parenthesis). – nicola Jul 22 '21 at 09:15

3 Answers3

10

You can use gsub with ;(?![^(]*\\)):

gsub(";(?![^(]*\\))", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"                   
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"                  
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"       

; finds ;, (?!) .. Negative Lookahead (make the replacement when it does not match), [^(] .. everything but not (, * repeat the previous 0 to n times, \\) .. flowed by ).

Or

gsub(";(?=[^)]*($|\\())", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"                   
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"                  
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"       

; finds ;, (?=) .. Positive Lookahead (make the replacement when it does match), [^)] .. everything but not ), * repeat the previous 0 to n times, ($|\\() .. match end $ or (.

Or using gregexpr and regmatches extracting the part between ( and ) and making the replacement in the non-matched substrings:

x <- gregexpr("\\(.*?\\)", in_vec)  #Find the part between ( and )
mapply(function(a, b) {
  paste(matrix(c(gsub(";", ";;", b), a, ""), 2, byrow=TRUE), collapse = "")
}, regmatches(in_vec, x), regmatches(in_vec, x, TRUE))
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"                   
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"                  
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"       

But all of them will work only for simple open ( close ) combinations.

GKi
  • 37,245
  • 2
  • 26
  • 48
  • thanks for a solution. +1 already. The requirement is like that only (open and close combination). But isn't any regex possible for matching specific characters outside this combination? – AnilGoyal Jun 08 '21 at 13:36
  • Yes, as the numbers of characters (to be replaced) before and after `( or )` are not certain – AnilGoyal Jun 08 '21 at 14:01
  • Works like a charm. Wish I could upvoted you twice. :) accepting this. Meanwhile can you please explain a bit about your regex? – AnilGoyal Jun 08 '21 at 15:10
  • Just out of curiosity, what if the input contains nested parenthesis? For instance, say you have `asagf;(fafgf;(sadg);sdag;a;gddfg;fd)gsfg;sdfa` and the desired output is `asagf;;(fafgf;(sadg);sdag;a;gddfg;fd)gsfg;;sdfa` (your solution doubles the first `;` inside the outer `(`). – nicola Jul 22 '21 at 09:50
  • @nicola Maybe ask a new question with this topic. When you change `x <- gregexpr("\\(.*?\\)", in_vec)` to `x <- gregexpr("(\\(([^()]|(?R))*\\))", in_vec, perl=TRUE)` it might work for nested parenthesis. – GKi Aug 16 '21 at 09:30
4

Though the problem can be tackled with regex, using a simple function might be more straightforward and easier to understand.

replace_semicolons_outside_parentheses <- function(raw_string) {
    """Replace ; with ;; outside of parentheses"""

    processed_string <- ""
    n_open_parentheses <- 0

    # Loops over characters in raw_string
    for (char in strsplit(raw_string, "")[[1]]) {

        # Update the net number of open parentheses
        if (char == "(") {
            n_open_parentheses <- n_open_parentheses + 1
        } else if (char == ")") {
            n_open_parentheses <- n_open_parentheses - 1
        }

        # Replace ; with ;; outside of parentheses
        if (char == ";" && n_open_parentheses == 0) {
            processed_string <- paste0(processed_string, ";;")
        } else {
            processed_string <- paste0(processed_string, char)
        }      
    }
    return(processed_string)
}

Note that the function above also works for nested parentheses: no semicolons inside nested parentheses are replaced! The desired output can be obtained in a single line:

out_vec <- lapply(in_vec, replace_semicolons_outside_parentheses)

# 1. 'abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag'
# 2. 'zvc;;dfasdf;;asdga;;asd(asd;hsfd)'
# 3. 'adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf'
# 4. 'asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa'
rpdejonge
  • 41
  • 3
1

Use the following in case of no nested parentheses:

gsub("\\([^()]*\\)(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  \(                       '(' char
--------------------------------------------------------------------------------
    [^()]*                   any character except: '(', ')' (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \)                       ')' char
--------------------------------------------------------------------------------
  (*SKIP)(*FAIL)           skip current match, search for new one from here
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  ;                        ';'

If there are nested parentheses:

gsub("(\\((?:[^()]++|(?1))*\\))(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \(                       '('
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      [^()]++                  any character except: '(', ')' (1 or more times 
                               (matching the most amount possible, no backtracking))
--------------------------------------------------------------------------------
     |                         or
--------------------------------------------------------------------------------
     (?1)                    recursing first group pattern
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
  \)                         ')' 
--------------------------------------------------------------------------------  
  )                          end of \1
--------------------------------------------------------------------------------  
  (*SKIP)(*FAIL)             skip the match, search for next
--------------------------------------------------------------------------------
  |                         or
--------------------------------------------------------------------------------
  ;                         ';'
--------------------------------------------------------------------------------
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37