19

What a strsplit function in R does is, match and delete a given regular expression to split the rest of the string into vectors.

>strsplit("abc123def", "[0-9]+")
[[1]]
[1] "abc" ""    ""    "def" 

But how should I split the string the same way using regular expression, but also retain the matches? I need something like the following.

>FUNCTION("abc123def", "[0-9]+")
[[1]]
[1] "abc" "123" "def" 

Using strapply("abc123def", "[0-9]+|[a-z]+") works here, but what if the rest of the string other than the matches cannot be captured by a regular expression?

jackson
  • 623
  • 1
  • 5
  • 12
  • You can capture all characters using pattern "[0-9]+|[^0-9]+" or extend pattern to capture everything else and discard it from output using function FUN=function(x) if(grepl("^[0-9a-z]+$",x)) x – Wojciech Sobala Jun 14 '12 at 03:37
  • Hi, I see you're new to SO. If you feel an answer solved the problem, please mark it as 'accepted' by clicking the green check mark. This helps keep the focus on older SO which still don't have answers. http://meta.stackexchange.com/questions/88535/asking-for-someone-to-accept-your-answer/135824#135824 – Ari B. Friedman Jun 19 '12 at 11:40

3 Answers3

25

Fundamentally, it seems to me that what you want is not to split on [0-9]+ but to split on the transition between [0-9]+ and everything else. In your string, that transition is not pre-existing. To insert it, you could pre-process with gsub and back-referencing:

test <- "abc123def"
strsplit( gsub("([0-9]+)","~\\1~",test), "~" )

[[1]]
[1] "abc" "123" "def"
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
9

You could use lookaround assertions.

> test <- "abc123def"
> strsplit(test, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", perl=T)
[[1]]
[1] "abc" "123" "def"
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • 7
    Why the downvote? It works perfectly for this input. – Avinash Raj Mar 23 '15 at 10:17
  • 1
    +1: Not only does it work, I find this solution much more elegant! Consider a case where you would like to split formulas whenever you come across a plus or minus operator. In between you have variable names which you would like to edit. Therefore, you can split it up, remain the operators as separate strings, edit the variable names and thereafter recombine the whole set of strings again. Works perfectly with this solution without losing the plus and minus operators! – ToJo May 07 '16 at 12:54
2

You can use strapply from gsubfn package.

test <- "abc123def"
strapply(X=test,
         pattern="([^[:digit:]]*)(\\d+)(.+)",
         FUN=c,
         simplify=FALSE)

[[1]]
[1] "abc" "123" "def"
Wojciech Sobala
  • 7,431
  • 2
  • 21
  • 27