split string with regex

Question

I'm looking to split a string of a generic form, where the square brackets denote the "sections" of the string. Ex:

x <- "[a] + [bc] + 1"

And return a character vector that looks like:

"[a]"  " + "  "[bc]" " + 1"

EDIT: Ended up using this:

x <- "[a] + [bc] + 1"
x <- gsub("\\[",",[",x)
x <- gsub("\\]","],",x)
strsplit(x,",")

Please post what you ended up using as an answer rather than an edit to the post. — ggorlen, Oct 28 '21 at 02:29

IRTFM · Accepted Answer · 2013-03-22T16:05:13.553

6

I've seen TylerRinker's code and suspect it may be more clear than this but this may serve as way to learn a different set of functions. (I liked his better before I noticed that it split on spaces.) I tried adapting this to work with strsplit but that function always removes the separators. Maybe this could be adapted to make a newstrsplit that splits at the separators but leaves them in? Probably need to not split at first or last position and distinguish between opening and closing separators.

scan(text=   # use scan to separate after insertion of commas
            gsub("\\]", "],",   # put commas in after "]"'s
            gsub(".\\[", ",[",  x)) ,  # add commas before "[" unless at first position
        what="", sep=",")    # tell scan this character argument and separators are ","
#Read 4 items
#[1] "[a]"  " +"   "[bc]" " + 1"

edited Mar 22 '13 at 16:05

answered Mar 22 '13 at 15:43

IRTFM

258,963
21
364
487

I like this approach as it is not dependent on white space for splitting. Maintaining the white space in the output was not important for this task, so I modified this to work with `strplit`: – Jeff Keller Mar 22 '13 at 16:05
Thanks for the positive comment, but I consider @juba's a better answer. I'm going to use it to construct a simple parsing function that accepts a pair of arguments to signal beginning and ending delimiters that will be preserved. – IRTFM Mar 22 '13 at 16:08

score 5 · Answer 2 · answered Mar 22 '13 at 15:34

5

This is one lazy approach:

FUN <- function(x) {
    all <- unlist(strsplit(x, "\\s+"))
    last <- paste(c(" ", tail(all, 2)), collapse="")
    c(head(all, -2), last)
}

x <- "[a] + [bc] + 1"    
FUN(x)

## > FUN(x)
## [1] "[a]"  "+"    "[bc]" " +1"

answered Mar 22 '13 at 15:34

Tyler Rinker

108,132
65
322
519

You say 'lazy' because you are using the spaces rather than using brackets to separate? – IRTFM Mar 22 '13 at 15:49
Yes ( no real intense regexing) – Tyler Rinker Mar 22 '13 at 16:23

juba · Answer 3 · 2013-03-22T15:53:57.303

5

You can compute the split points manually and use substring :

split.pos <- gregexpr('\\[.*?]',x)[[1]]
split.length <- attr(split.pos, "match.length")
split.start <- sort(c(split.pos, split.pos+split.length))
split.end <- c(split.start[-1]-1, nchar(x))
substring(x,split.start,split.end)
#  [1] "[a]"  " + "  "[bc]" " + 1"

edited Mar 22 '13 at 15:53

answered Mar 22 '13 at 15:40

juba

47,631
14
113
118

1

There we go. Great progress toward making a 'newsplit'. Not that I understand it fully, but I thought `gregexpr` would be useful. I was surprised you didn't need to use "\\]" in the pattern. – IRTFM Mar 22 '13 at 16:04
I think `]` doesn't need to be escaped because it is not interpreted as an end of character class due to the fact that `[` is. Hmm, not sure I'm very clear :-) – juba Mar 22 '13 at 16:05
I had the same thought, but it suggests that "specialness" is more context dependent than I would have expected. – IRTFM Mar 22 '13 at 16:10

score 5 · Answer 4 · answered Mar 22 '13 at 16:12

And here's a version that splits on the brackets AND keeps them in the result, using positive lookahead and lookbehind:

splitme <- function(x) {
  x <- unlist(strsplit(x, "(?=\\[)", perl=TRUE))
  x <- unlist(strsplit(x, "(?<=\\])", perl=TRUE))
  for (i in which(x=="[")) {
    x[i+1] <- paste(x[i], x[i+1], sep="")
  }
  x[-which(x=="[")]
}
splitme(x)
#[1] "[a]"  " + "  "[bc]" " + 1"

split string with regex

4 Answers4

Linked