5

I would like to split a column of strings on the first two colons, but not on any subsequent colons:

my.data <- read.table(text='

my.string    some.data
123:34:56:78   -100
87:65:43:21    -200
a4:b6:c8888    -300
11:bbbb:ccccc  -400
uu:vv:ww:xx    -500', header = TRUE)

desired.result <- read.table(text='

my.string1  my.string2  my.string3  some.data
123         34          56:78         -100
87          65          43:21         -200
a4          b6          c8888         -300
11          bbbb        ccccc         -400
uu          vv          ww:xx         -500', header = TRUE)

I have searched extensively and the following question is the closest to my current dilemma:

Split on first comma in string

Thank you for any suggestions. I prefer to use base R.

EDIT:

The number of characters before the first colon is not always two and the number of characters between the first two colons is not always two. So, I edited the example to reflect this.

Community
  • 1
  • 1
Mark Miller
  • 12,483
  • 23
  • 78
  • 132

5 Answers5

4

Using package stringr:

str_match(my.data$my.string, "(.+?):(.+?):(.*)")

     [,1]            [,2]  [,3]   [,4]   
[1,] "123:34:56:78"  "123" "34"   "56:78"
[2,] "87:65:43:21"   "87"  "65"   "43:21"
[3,] "a4:b6:c8888"   "a4"  "b6"   "c8888"
[4,] "11:bbbb:ccccc" "11"  "bbbb" "ccccc"
[5,] "uu:vv:ww:xx"   "uu"  "vv"   "ww:xx"

UPDATE: with latest example (above) and Hadley's comment solution:

str_split_fixed(my.data$my.string, ":", 3)
     [,1]  [,2]   [,3]   
[1,] "123" "34"   "56:78"
[2,] "87"  "65"   "43:21"
[3,] "a4"  "b6"   "c8888"
[4,] "11"  "bbbb" "ccccc"
[5,] "uu"  "vv"   "ww:xx"
topchef
  • 19,091
  • 9
  • 63
  • 102
4

In base R:

> my.data <- read.table(text='
+ 
+ my.string    some.data
+ 123:34:56:78   -100
+ 87:65:43:21    -200
+ a4:b6:c8888    -300
+ 11:bbbb:ccccc  -400
+ uu:vv:ww:xx    -500', header = TRUE,stringsAsFactors=FALSE)
> m <- regexec ("^([^:]+):([^:]+):(.*)$",my.data$my.string)
> my.data$my.string1 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(2)))
> my.data$my.string2 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(3)))
> my.data$my.string3 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(4)))
> my.data
      my.string some.data my.string1 my.string2 my.string3
1  123:34:56:78      -100        123         34      56:78
2   87:65:43:21      -200         87         65      43:21
3   a4:b6:c8888      -300         a4         b6      c8888
4 11:bbbb:ccccc      -400         11       bbbb      ccccc
5   uu:vv:ww:xx      -500         uu         vv      ww:xx

You'll see I've used stringsAsFactors=FALSE to ensure that my.string can be processed as a vector of strings.

Simon
  • 10,679
  • 1
  • 30
  • 44
  • This is a great answer, but I am wondering what the numbers in `m` mean? – Mark Miller Nov 03 '13 at 04:39
  • 1
    `regexec()` returns a match object of which the first element of each match is a vector of starting match locations for each group (with the whole match first, so the first explicit group is #2, the second is #3, etc.) and the second is a vector match lengths. `regmatches()` then uses that match data to extract the matched text from the vector of strings. – Simon Nov 03 '13 at 04:46
1

Replace first two ":" with ",", and then split on ",".

x <- gsub("([[:alnum:]]*):([[:alnum:]]*):(.)","\\1,\\2,\\3","12:34:56:78")

strsplit(x,",")

Applying to data frame

a.list <- sapply(my.data$my.string, function(x) strsplit(gsub("([[:alnum:]]*):([[:alnum:]]*):(.)","\\1,\\2,\\3",x),","))
a.vect <- unlist(a.list)
a.df <- as.data.frame(matrix(a.vect,ncol=3,byrow=T), stringsAsFactors = F)
names(a.df) <- c("my.string1",  "my.string2",  "my.string3") 
a.df$some.data <- my.data$some.data
a.df 
ndr
  • 1,427
  • 10
  • 11
  • this is neat but it requires using character in place of comma that may never appear in the rest of string. – topchef Nov 03 '13 at 04:37
  • @topchef True, something like "ZZZZZZZZZZ8888888888" would likely do it :) – ndr Nov 03 '13 at 04:53
1

I'm a bit late to the game. And my solution has much overlap with the earlier answers. Nevertheless, it might be useful someone:

# Replace first two colons with commas.
new.string = gsub(pattern="(^[^:]+):([^:]+):(.+$)",
                  replacement="\\1,\\2,\\3",
                  x=my.data$my.string)

# Split on commas, producing a list.
split.data = strsplit(new.string, ",")

# Change list into matrix, then data.frame.
new.data = data.frame(do.call(rbind, split.data))
names(new.data) = paste("my.string", seq(ncol(new.data)), sep="")

my.data$my.string = NULL
my.data = cbind(new.data, my.data)
my.data

#   my.string1 my.string2 my.string3 some.data
# 1        123         34      56:78      -100
# 2         87         65      43:21      -200
# 3         a4         b6      c8888      -300
# 4         11       bbbb      ccccc      -400
# 5         uu         vv      ww:xx      -500

As noted by @topchef, commas (or some other character) must guaranteed to be absent from the data.

Also, at least two colons must be present in each string, or else the pattern doesn't match anything and thus no splitting occurs.

bdemarest
  • 14,397
  • 3
  • 53
  • 56
0

Couldn't you just strsplit(sub(":\s*", XX, x), XX) (like the example listed on your link to the other question) on the first colon, take the second half and split on the first colon again?

The Chad
  • 51
  • 1