3

I have the following string:

x <- "(((K05708+K05709+K05710+K00529) K05711),K05712) K05713 K05714 K02554"
# [1] "(((K05708+K05709+K05710+K00529) K05711),K05712) K05713 K05714 K02554"

and I want to split it by space delimiter avoiding what's inside the parentheses in order to have something like:

[[1]]
[1] "(((K05708+K05709+K05710 K00529) K05711),K05712)"                
[2] "K05713"                          "K05714"                         
[4] "K02554"

See that two spaces remain inside the first parentheses level.

I read the following answers but I couldn't make it work in my case: r split on delimiter not in parentheses and Using strsplit() in R, ignoring anything in parentheses

Thanks in advance!

Community
  • 1
  • 1
IgnacioF
  • 55
  • 5
  • Looks like your string has nested balanced `()`, and you need to skip those spaces inside *balanced* parentheses, right? – Wiktor Stribiżew Sep 27 '16 at 20:41
  • Yes! You are correct. – IgnacioF Sep 27 '16 at 20:43
  • Does the last parenthesis on each line always mark the end of the first field? Are the number of fields known (here 4)? – G. Grothendieck Sep 27 '16 at 22:55
  • Regarding your 2nd question: no they aren't. It's just an example of many possibilities. I'm not following you on the first one, are you asking if all the cases I have follow the same pattern of being the first field the nested one? – IgnacioF Sep 28 '16 at 11:49

1 Answers1

3

I think you need a regex matching the balanced parentheses and then skipping them, and then matching the whitespaces that remain with the following PCRE-based regex:

(\((?:[^()]++|(?1))*\))(*SKIP)(*F)|\s

See the regex demo (replace the space with \s above for better visibility).

Pattern details:

  • (\((?:[^()]++|(?1))*\))(*SKIP)(*F) - Group 1 matching
    • \((?:[^()]++|(?1))*\) - a substring presenting a balanced parentheses substring: \( matches a (, (?:[^()]++|(?1))* matches zero or more (*) sequences of 1+ chars other than ( and ) (see [^()]++) or the whole pattern of this whole Group 1 (see the subrouting call (?1)), then \) matches a literal ) and (*SKIP)(*F) make the regex discard the whole matched text while keeping the regex index at the end of that match, and proceed looking for the next match
  • | - or
  • - a space to split against

Here is an online R demo:

s <- "(((K05708+K05709+K05710+K00529) K05711),K05712) K05713 K05714 K02554"
strsplit(s, "(\\((?:[^()]++|(?1))*\\))(*SKIP)(*F)| ", perl=TRUE)

Output:

[[1]]
[1] "(((K05708+K05709+K05710+K00529) K05711),K05712)"
[2] "K05713"                                         
[3] "K05714"                                         
[4] "K02554"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks! It seems to work fine. Could you explain a little the regex you used? – IgnacioF Sep 27 '16 at 20:48
  • Please check my answer. If the explanation is not enough, see also [Regex Recursion](http://www.regular-expressions.info/recurse.html) and [Subroutines](http://www.regular-expressions.info/subroutine.html). Also, see [How do (*SKIP) or (*F) work on regex?](http://stackoverflow.com/questions/24534782/how-do-skip-or-f-work-on-regex). – Wiktor Stribiżew Sep 27 '16 at 20:55