1

I have the following string:

Almonds ; Roasted Peanuts  (Peanuts; Canola Oil  (Antioxidants (319; 320)); Salt); Cashews 

I want to replace the semicolons that are not in parenthesis to commas. There can be any number of brackets and any number of semicolons within the brackets and the result should look like this:

Almonds , Roasted Peanuts  (Peanuts; Canola Oil  (Antioxidants (319; 320)); Salt), Cashews 

This is my current code:

x<- Almonds ; Roasted Peanuts  (Peanuts; Canola Oil  (Antioxidants (319; 320)); Salt); Cashews 

gsub(";(?![^(]*\\))",",",x,perl=TRUE)

[1] "Almonds , Roasted Peanuts  (Peanuts, Canola Oil  (Antioxidants (319; 320)); Salt), Cashews "

The problem I am facing is if there's a nested () inside a bigger bracket, the regex I have will replace the semicolon to comma.

Can I please get some help on regex that will solve the problem? Thank you in advance.

Catherine
  • 23
  • 4
  • However, the comma between `Almonds` and `Roasted` *is* in between parentheses. – Wiktor Stribiżew Sep 22 '21 at 23:16
  • That's fine as my final goal is to split the string by comma. I can exclude the big bracket too if it makes things easier, i.e. ```Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews ``` Hope that makes sense :) – Catherine Sep 22 '21 at 23:29
  • Another example: ```Peanuts (32.5%); Macadamia Nuts (14%; PPPG(AHA)); Hazelnuts (9%); nuts(98%)``` Ideal result: ```Peanuts (32.5%), Macadamia Nuts (14%; PPPG(AHA)), Hazelnuts (9%), nuts(98%)``` – Catherine Sep 22 '21 at 23:37
  • I already have an answer how to split with a comma not inside nested parentheses, see [this thread](https://stackoverflow.com/questions/52993644/regex-split-string-on-comma-skip-anything-between-balanced-parentheses/52994460#52994460). You do not need any `gsub` here, use `strsplit` directly. – Wiktor Stribiżew Sep 23 '21 at 07:56
  • The question is about replacing, not splitting. – The fourth bird Sep 23 '21 at 08:27
  • 1
    @Thefourthbird ["my final goal is to split the string by comma"](https://stackoverflow.com/questions/69292204/regex-to-match-only-semicolons-not-in-parenthesis#comment122472154_69292204) - this is about splitting. Please re-close. – Wiktor Stribiżew Sep 23 '21 at 19:06
  • Thank you very much for your help :) I appreciate it! – Catherine Sep 25 '21 at 10:15

1 Answers1

2

The pattern ;(?![^(]*\)) means matching a semicolon, and assert that what is to the right is not a ) without a ( in between.

That assertion will be true for a nested opening parenthesis, and will still match the ;


You could use a recursive pattern to match nested parenthesis to match what you don't want to change, and then use a SKIP FAIL approach.

Then you can match the semicolons and replace them with a comma.

[^;]*(\((?>[^()]+|(?1))*\))(*SKIP)(*F)|;

In parts, the pattern matches

  • [^;]* Match 0+ times any char except ;
  • ( Capture group 1
    • \( Match the opening (
    • (?> Atomic group
      • [^()]+ Match 1+ times any char except ( and )
      • | Or
      • (?1) Recurse the whole first sub pattern (group 1)
    • )* Close the atomic group and optionally repeat
    • \) Match the closing )
  • ) Close group 1
  • (*SKIP)(*F) Skip what is matched
  • | Or
  • ; Match a semicolon

See a regex demo and an R demo.

x <- c("Almonds ; Roasted Peanuts  (Peanuts; Canola Oil  (Antioxidants (319; 320)); Salt); Cashews",
"Peanuts (32.5%); Macadamia Nuts (14%; PPPG(AHA)); Hazelnuts (9%); nuts(98%)")

gsub("[^;]*(\\((?>[^()]+|(?1))*\\))(*SKIP)(*F)|;",",",x,perl=TRUE)

Output

[1] "Almonds , Roasted Peanuts  (Peanuts; Canola Oil  (Antioxidants (319; 320)); Salt), Cashews"
[2] "Peanuts (32.5%), Macadamia Nuts (14%; PPPG(AHA)), Hazelnuts (9%), nuts(98%)"               
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • 1
    Just FYI: `(?>[^()]++|(?1))*` does not have to contain both atomic group and possessive quantifier (they have the same purpose to block backtracking), either will suffice, `(?:[^()]++|(?1))*` or `(?>[^()]+|(?1))*`. – Wiktor Stribiżew Sep 23 '21 at 07:50