R cSplit only using first delimiter in string

Question

I had a long list with two columns where the I had the same string in each column in multiple rows. So I used paste to concatenate using - and then used setDT to return the unique set of concats with their frequency.

Now I want to reverse my concatenation.

I tried:

library(splitstackshape)
d5 <- cSplit(d4, 'conc', '-', 'wide')

However in my second column I sometimes had multiple -'s within the string.

To get around this I'd like cSplit to ONLY use the first - delimiter.

Example:

 conc      freq
 A-hello      4
 A-Hi-there   5
 B-HELLO      1

Using the above cSplit would return:

freq conc_001  conc_002  conc_003
   4        A     hello        NA
   5        A        Hi     there
   1        B     HELLO        NA

I would like:

freq conc_001  conc_002
   4        A     hello
   5        A  Hi-there
   1        B     HELLO

You might want to use `separate` from the "tidyr" package. I didn't design `cSplit` to conveniently handle these types of cases. With "tidyr", the approach might be something like `separate(mydf, conc, into = c("conc_001", "conc_002"), extra = "merge")`. — A5C1D2H2I1M1N2O1R2T1, May 18 '16 at 16:07
I suppose you could also do something silly like: `cSplit(setDT(mydf)[, conc := sub("-", "%^%&", conc)], "conc", "%^%&")` :-) — A5C1D2H2I1M1N2O1R2T1, May 18 '16 at 16:20

score 3 · Accepted Answer · answered May 18 '16 at 16:34

3

Here is another idea.By using sub we restrict it to only change the first specified delimeter of the string. We then use cSplit with the new delimeter.

library(splitstackshape)
df$conc <- sub('-', ' ', df$conc)
cSplit(df, 'conc', ' ', 'wide')
#   freq conc_1   conc_2
#1:    4      A    hello
#2:    5      A Hi-there
#3:    1      B    HELLO

answered May 18 '16 at 16:34

Sotos

51,121
6
32
66

1

Thanks again Sotos! So sub only acts on the first and gsub acts on all. – Oli May 19 '16 at 08:08

Dave2e · Answer 2 · 2016-05-18T16:15:44.917

2

Try this, maybe not as straight forward as using the csplit function. Performance is fairly fast with this method.

#Sample Data    
s<-c("A-hello", "A-Hi-there", "B-HELLO")
df<-data.frame(s)

#split the data into 2 parts and assign to new columns in the dataframe.
library(stringr)
mat  <- matrix(unlist(str_split(df$s, "-", n=2)), ncol=2, byrow=TRUE)
dfnew<-as.data.frame(mat, stringsAsFactors = FALSE)

Once the matrix "mat" is created, one can cbind the result onto your original matrix.

edited May 18 '16 at 16:15

answered May 18 '16 at 16:03

Dave2e

22,192
18
42
50

This also works and im guessing unlike sub and gsub having n=x allows for more complicated cases. Thanks! – Oli May 19 '16 at 08:13
1

I guess you could use `str_split_fixed` and avoid `matrix(unlist...` part – Sotos May 19 '16 at 08:29
As per Sotos comment, this works: data.frame(str_split_fixed(df$s, "-", n=2)). The result is the desired 2 column by n dataframe. – Dave2e May 19 '16 at 12:16
x<-(str_split(df$conc, "-", n=2) then y <- data.frame(x) then z <- t(y) then df1 <- data.frame(z) then df2 <- data.frame(sec=z$X1, word=z$X2) then df3 <- data.frame(sec=df2$sec, word=df2$word) worked too. – Oli May 19 '16 at 12:36

R cSplit only using first delimiter in string

2 Answers2

Linked