3

Need to write a regex in R - Perl which would split string on comma ',' but skip all instances of comma in between round parenthesis. The challenge is to ensure that the parentheses are balanced, i.e. close bracket maps back to its open bracket.

In the below regex code, everything works perfectly, except if you notice - Parentheses are not balanced, an inner end bracket is being considered for an outer start bracket

text <- "PEANUTS (PEANUTS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR CANOLA OIL AND/OR SOYBEAN OIL, SALT), GOLDEN RAISINS (RAISINS, SULFUR DIOXIDE), DRIED CRANBERRIES (CRANBERRIES, SUGAR, CITRIC ACID, SUNFLOWER OIL (PROCESSING AID), ELDERBERRY JUICE CONCENTRATE (COLOR)), ALMONDS (ALMONDS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR CANOLA OIL AND/OR SOYBEAN OIL, SALT), MACADAMIAS (MACADAMIAS, MALTODEXTRIN, SALT)"

strsplit(text, '\\([^*)^)]*\\)(*SKIP)(*F)|\\,', perl=T)

Using the above regex code, Dried Cranberries is not being splitted correctly. Please refer to the output screenshot here: Regex Code Output

Any help here would be much appreciated.. Thank you!

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Please do not post an image of code/data/errors, just the text itself. Several reasons are immediate: I cannot copy code or data from your image into my R console and try it out, and I choose to not transcribe it manually. Some reasons are slightly less obvious but still important, including: it breaks screen readers *hard*; search engines don't read them, so searches will not find it; mobile device screen size might be a limiting factor. Ref: https://meta.stackoverflow.com/a/285557/3358272 – r2evans Oct 25 '18 at 16:16
  • Once you get into nested matching parens, I suggest you get into tokenizing it vice regular expressions. – r2evans Oct 25 '18 at 16:19

2 Answers2

1

You may use

strsplit(text, "(\\((?:[^()]++|(?1))*\\))(*SKIP)(*F)|,", perl=TRUE)
# => [[1]]
[1] "PEANUTS (PEANUTS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR CANOLA OIL AND/OR SOYBEAN OIL, SALT)"                            
[2] " GOLDEN RAISINS (RAISINS, SULFUR DIOXIDE)"                                                                                 
[3] " DRIED CRANBERRIES (CRANBERRIES, SUGAR, CITRIC ACID, SUNFLOWER OIL (PROCESSING AID), ELDERBERRY JUICE CONCENTRATE (COLOR))"
[4] " ALMONDS (ALMONDS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR CANOLA OIL AND/OR SOYBEAN OIL, SALT)"                           
[5] " MACADAMIAS (MACADAMIAS, MALTODEXTRIN, SALT)" 

See the regex demo and an online R demo.

Details

  • (\\((?:[^()]++|(?1))*\\)) - a capturing group #1 that captures
    • \\( - a ( char
    • (?:[^()]++|(?1))* - 0 or more occurrences of 1+ chars other than ( and ) (with [^()]++) or (|) the whole group 1 pattern (that is recursed to match all nested levels)
    • \\) - a ) char
  • (*SKIP)(*F) - the two verbs make the engine skip the currently matched string and go on to look for the next match immediately after this text.
  • | - or
  • , - a comma.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Exactly what I was looking for. The explanation is helpful for me to make any modifications if necessary. Accepted your answer, Cheers! – Mohit Gulla Oct 26 '18 at 10:05
0

The edit to the accepted answer to this question seems to do the job. I just added [[:alpha:][:space:]]* at the beginning.

pat <- '[[:alpha:][:space:]]*\\(((?>[^()]+)|(?R))*\\)'
regmatches(text, gregexpr(pat, text, perl = TRUE))
#[[1]]
#[1] "PEANUTS (PEANUTS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR #CANOLA OIL AND/OR SOYBEAN OIL, SALT)"                            
#[2] " GOLDEN RAISINS (RAISINS, SULFUR DIOXIDE)"                                                                                 
#[3] " DRIED CRANBERRIES (CRANBERRIES, SUGAR, CITRIC ACID, SUNFLOWER #OIL (PROCESSING AID), ELDERBERRY JUICE CONCENTRATE (COLOR))"
#[4] " ALMONDS (ALMONDS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR #CANOLA OIL AND/OR SOYBEAN OIL, SALT)"                           
#[5] " MACADAMIAS (MACADAMIAS, MALTODEXTRIN, SALT)" 
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66