2

Consider the following code in R:

x <- "A, B (C, D, E), F, G [H, I, J], K (L (M, N), O), P (Q (R, S (T, U)))"
strsplit(x, split = "some regex here")

I would like this to return something resembling a list containing the character vector

"A"
"B (C, D, E)"
"F"
"G [H, I, J]"
"K (L (M, N), O)"
"P (Q (R, S (T, U)))"

EDIT: The proposed alternative questions do not answer my question, since nested parentheses and brackets are allowed, and it is possible for n-level nesting to occur (beyond 2).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Clarinetist
  • 1,097
  • 18
  • 46

2 Answers2

2

This looks more like a job for a custom parser than a single regex. I would love to be proved wrong, but while we're waiting, here's a very pedestrian parsing function that gets the job done.

parse_nested <- function(string) {
  
  chars <- strsplit(string, "")[[1]]
  
  parentheses <- numeric(length(chars))
  parentheses[chars == "("] <- 1
  parentheses[chars == ")"] <- -1
  parentheses <- cumsum(parentheses)

  brackets <- numeric(length(chars))
  brackets[chars == "["] <- 1
  brackets[chars == "]"] <- -1
  brackets <- cumsum(brackets)
  
  split_on <- which(brackets == 0 & parentheses == 0 & chars == ",")
  split_on <- c(0, split_on, length(chars) + 1)
  
  result <- character()
  
  for(i in seq_along(head(split_on, -1))) {
    x <- paste0(chars[(split_on[i] + 1):(split_on[i + 1] - 1)], collapse = "")
    result <- c(result, x)
  }
  
  trimws(result)
}

Which produces:

parse_nested(x)
#> [1] "A"                   "B (C, D, E)"         "F"                  
#> [4] "G [H, I, J]"         "K (L (M, N), O)"     "P (Q (R, S (T, U)))"
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
0

Using regex only. Since stringr does not allow for recursion, we need to use base R.

x <- "A, B (C, D, E), F, G [H, I, J], K (L (M, N), O), P (Q (R, S (T, U)))"

regmatches(x, 
  gregexpr("([A-Z] )*([\\(\\[](?>[^()\\[\\]]|(?R))*[\\)\\]])|[A-Z]", 
            x, perl = TRUE))

#> [[1]]
#> [1] "A"                   "B (C, D, E)"         "F"                  
#> [4] "G [H, I, J]"         "K (L (M, N), O)"     "P (Q (R, S (T, U)))"
PaulS
  • 21,159
  • 2
  • 9
  • 26