2

I'd like to split an arbitrary string such as

x <- "(((K05708+K05709+K05710+K00529) K05711),K05712),K05713 K05714 K02554"
# [1] "(((K05708+K05709+K05710+K00529) K05711),K05712),K05713 K05714 K02554"

at delimiter(s) (here a space and a comma) except when they are within parentheses, and also keep the delimiters as part of the output

[[1]]
[1] "(((K05708+K05709+K05710 K00529) K05711),K05712)"                
[2] ",K05713"                          " K05714"                         
[4] " K02554"

This example is copied almost directly from IgnacioF's (https://stackoverflow.com/users/5935889/ignaciof) post Split string by space except what's inside parentheses, as the example is a mere extension to it, and in knowing hands, solution could be too.

In the case of single delimiter, I could paste it into the output vector elements, but with multiple simultaneous delimiters, their identities are lost at splitting, so AFAIK this wouldn't work.

I have tried to find solution that keeps the delimiters using lookahead and other modifications to the solution to the original post, but in vain mostly because my lack of understanding its solution.

yuppe
  • 23
  • 4

1 Answers1

3

You can use

x <- "(((K05708+K05709+K05710+K00529) K05711),K05712),K05713 K05714 K02554"
rx <- "(\\((?:[^()]++|(?1))*\\))(*SKIP)(*F)|(?<=[^\\s,])(?=[\\s,])"
strsplit(x, rx, perl=TRUE)
# => [[1]]
# => [1] "(((K05708+K05709+K05710+K00529) K05711),K05712)" ",K05713" 
# => [3] " K05714"                                         " K02554"           

The pattern here is (\((?:[^()]++|(?1))*\))(*SKIP)(*F)|(?<=[^\s,])(?=[\s,]), see its demo online.

Details:

  • (\((?:[^()]++|(?1))*\))(*SKIP)(*F) - Group 1 matching a substring presenting a balanced parentheses substring: \( matches a (, (?:[^()]++|(?1))* matches zero or more (*) sequences of 1+ chars other than ( and ) (see [^()]++) or the whole pattern of this whole Group 1 (see the subrouting call (?1)), then \) matches a literal ) and (*SKIP)(*F) make the regex discard the whole matched text while keeping the regex index at the end of that match, and proceed looking for the next match
  • | - or
  • (?<=[^\s,])(?=[\s,]) - a position in between a character other than whitespace and comma and a whitespace or comma char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563