4

I would like to split strings on the first and last comma. Each string has at least two commas. Below is an example data set and the desired result.

A similar question here asked how to split on the first comma: Split on first comma in string

Here I asked how to split strings on the first two colons: Split string on first two colons

Thank you for any suggestions. I prefer a solution in base R. Sorry if this is a duplicate.

my.data <- read.table(text='

my.string        some.data
123,34,56,78,90     10
87,65,43,21         20
a4,b6,c8888         30
11,bbbb,ccccc       40
uu,vv,ww,xx         50
j,k,l,m,n,o,p       60', header = TRUE, stringsAsFactors=FALSE)

desired.result <- read.table(text='

 my.string1 my.string2 my.string3 some.data
        123   34,56,78         90        10
         87      65,43         21        20
         a4         b6      c8888        30
         11       bbbb      ccccc        40
         uu      vv,ww         xx        50
          j  k,l,m,n,o          p        60', header = TRUE, stringsAsFactors=FALSE)
Community
  • 1
  • 1
Mark Miller
  • 12,483
  • 23
  • 78
  • 132

5 Answers5

4

You can use the \K operator which keeps text already matched out of the result and a negative look ahead assertion to do this (well almost, there is an annoying comma at the start of the middle portion which I am yet to get rid of in the strsplit). But I enjoyed this as an exercise in constructing a regex...

x <- '123,34,56,78,90'
strsplit( x , "^[^,]+\\K|,(?=[^,]+$)" , perl = TRUE )
#[[1]]
#[1] "123"       ",34,56,78" "90"

Explantion:

  • ^[^,]+ : from the start of the string match one or more characters that are not a ,
  • \\K : but don't include those matched characters in the match
  • So the first match is the first comma...
  • | : or you can match...
  • ,(?=[^,]+$) : a , so long as it is followed by [(?=...)] one or more characters that are not a , until the end of the string ($)...
Community
  • 1
  • 1
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • +1, but you're missing a `,` after the `\K`, as he doesn't want the splitting `,` in the match. I'd suggest changing the regex to `^[^,]+\K,|,(?=[^,]+$)` or `^\w+\K,|,(?=\w+$)` – zx81 May 31 '14 at 02:44
  • @zx81 try it. You won't get what you are expecting. Think of it this way - Q: if you add the comma you suggest, how many of the comma's will then match the regex? A: All of them! Your other suggestion will also not do what you expect. – Simon O'Hanlon May 31 '14 at 08:20
  • 1
    Simon, I had tried it. The second solution I gave you is the one I had come up with independently, and I didn't want to post it in order not to compete with you because the solutions were so close. Here's the [demo](http://regex101.com/r/sC6tP9), you can see that it only matches the right commas. And here's a [demo of your one](http://regex101.com/r/bA2oA1) with the additional comma I am suggesting. – zx81 May 31 '14 at 10:51
  • @zx81 let me be more specific - try it in R. I can't figure out why it won't work as you expect, but it doesn't. Must be something to do with greedy matching. I do not doubt that you came up with it independently at all btw! – Simon O'Hanlon May 31 '14 at 11:02
  • @zx81 that is actually quite wierd... using `m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE );regmatches( x , m )` shows the desired matches are returned; the two and only two commas per line, (where `x` is a character vector of the strings in your demo). – Simon O'Hanlon May 31 '14 at 11:06
  • Simon, glad you're able to test it in R, because I can't do that directly at the moment. I had only tested it with RegexBuddy in R mode. Sounds like you found the magic moves to make it work. :) Btw for that regex, the R code suggested by RB was `strsplit(subject, "^\\w+\\K,|,(?=\\w+$)", perl=TRUE);` – zx81 May 31 '14 at 11:14
  • @zx81 I've actually had to ask [a question](http://stackoverflow.com/q/23969411/1478381) because I can't work out why `strsplit` does not give the desired result. – Simon O'Hanlon May 31 '14 at 11:17
  • @SimonO'Hanlon I want to give you the check mark, but I cannot with that one comma still present. I know you posted a follow-up question. Might the answer to your post help to get rid of the comma? All of the answers deserve the check mark, but yours looks very nice. – Mark Miller Jun 01 '14 at 00:24
3

Here is a relatively simple approach. In the first line we use sub to replace the first and last commas with semicolons producing s. Then we read s using sep=";" and finally cbind the rest of my.data to it:

s <- sub(",(.*),", ";\\1;", my.data[[1]])
DF <- read.table(text=s, sep =";", col.names=paste0("mystring",1:3), as.is=TRUE)
cbind(DF, my.data[-1])

giving:

  mystring1 mystring2 mystring3 some.data
1       123  34,56,78        90        10
2        87     65,43        21        20
3        a4        b6     c8888        30
4        11      bbbb     ccccc        40
5        uu     vv,ww        xx        50
6         j k,l,m,n,o         p        60
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

Here is code to split on the first and last comma. This code draws heavily from an answer by @bdemarest here: Split string on first two colons The gsub pattern below, which is the meat of the answer, contains important differences. The code for creating the new data frame after strings are split is the same as that of @bdemarest

# Replace first and last commas with colons.

new.string <- gsub(pattern="(^[^,]+),(.+),([^,]+$)", 
              replacement="\\1:\\2:\\3", x=my.data$my.string)
new.string

# Split on colons
split.data <- strsplit(new.string, ":")

# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")

my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data

#   my.string1 my.string2 my.string3 some.data
# 1        123   34,56,78         90        10
# 2         87      65,43         21        20
# 3         a4         b6      c8888        30
# 4         11       bbbb      ccccc        40
# 5         uu      vv,ww         xx        50
# 6          j  k,l,m,n,o          p        60



# Here is code for splitting strings on the first comma

my.data <- read.table(text='

my.string        some.data
123,34,56,78,90     10
87,65,43,21         20
a4,b6,c8888         30
11,bbbb,ccccc       40
uu,vv,ww,xx         50
j,k,l,m,n,o,p       60', header = TRUE, stringsAsFactors=FALSE)


# Replace first comma with colon

new.string <- gsub(pattern="(^[^,]+),(.+$)", 
                   replacement="\\1:\\2", x=my.data$my.string)
new.string

# Split on colon
split.data <- strsplit(new.string, ":")

# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")

my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data

#   my.string1  my.string2 some.data
# 1        123 34,56,78,90        10
# 2         87    65,43,21        20
# 3         a4    b6,c8888        30
# 4         11  bbbb,ccccc        40
# 5         uu    vv,ww,xx        50
# 6          j k,l,m,n,o,p        60




# Here is code for splitting strings on the last comma

my.data <- read.table(text='

my.string        some.data
123,34,56,78,90     10
87,65,43,21         20
a4,b6,c8888         30
11,bbbb,ccccc       40
uu,vv,ww,xx         50
j,k,l,m,n,o,p       60', header = TRUE, stringsAsFactors=FALSE)


# Replace last comma with colon

new.string <- gsub(pattern="^(.+),([^,]+$)", 
                   replacement="\\1:\\2", x=my.data$my.string)
new.string

# Split on colon
split.data <- strsplit(new.string, ":")

# Create new data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")

my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data

#     my.string1 my.string2 some.data
# 1 123,34,56,78         90        10
# 2     87,65,43         21        20
# 3        a4,b6      c8888        30
# 4      11,bbbb      ccccc        40
# 5     uu,vv,ww         xx        50
# 6  j,k,l,m,n,o          p        60
Community
  • 1
  • 1
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
1

You can do a simple strsplit here on that column

popshift<-sapply(strsplit(my.data$my.string,","), function(x) 
    c(x[1], paste(x[2:(length(x)-1)],collapse=","), x[length(x)]))

desired.result <- cbind(data.frame(my.string=t(popshift)), my.data[-1])

I just split up all the values and make a new vector with the first, last and middle strings. Then i cbind that with the rest of the data. The result is

  my.string.1 my.string.2 my.string.3 some.data
1         123    34,56,78          90        10
2          87       65,43          21        20
3          a4          b6       c8888        30
4          11        bbbb       ccccc        40
5          uu       vv,ww          xx        50
6           j   k,l,m,n,o           p        60
MrFlick
  • 195,160
  • 17
  • 277
  • 295
1

Using str_match() from package stringr, and a little help from one of your links,

> library(stringr)
> data.frame(str_match(my.data$my.string, "(.+?),(.*),(.+?)$")[,-1], 
             some.data = my.data$some.data)
#    X1        X2    X3 some.data
# 1 123  34,56,78    90        10
# 2  87     65,43    21        20
# 3  a4        b6 c8888        30
# 4  11      bbbb ccccc        40
# 5  uu     vv,ww    xx        50
# 6   j k,l,m,n,o     p        60
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245