4

I had a long list with two columns where the I had the same string in each column in multiple rows. So I used paste to concatenate using - and then used setDT to return the unique set of concats with their frequency.

Now I want to reverse my concatenation.

I tried:

library(splitstackshape)
d5 <- cSplit(d4, 'conc', '-', 'wide')

However in my second column I sometimes had multiple -'s within the string.

To get around this I'd like cSplit to ONLY use the first - delimiter.

Example:

 conc      freq
 A-hello      4
 A-Hi-there   5
 B-HELLO      1

Using the above cSplit would return:

freq conc_001  conc_002  conc_003
   4        A     hello        NA
   5        A        Hi     there
   1        B     HELLO        NA

I would like:

freq conc_001  conc_002
   4        A     hello
   5        A  Hi-there
   1        B     HELLO
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Oli
  • 532
  • 1
  • 5
  • 26
  • You might want to use `separate` from the "tidyr" package. I didn't design `cSplit` to conveniently handle these types of cases. With "tidyr", the approach might be something like `separate(mydf, conc, into = c("conc_001", "conc_002"), extra = "merge")`. – A5C1D2H2I1M1N2O1R2T1 May 18 '16 at 16:07
  • I suppose you could also do something silly like: `cSplit(setDT(mydf)[, conc := sub("-", "%^%&", conc)], "conc", "%^%&")` :-) – A5C1D2H2I1M1N2O1R2T1 May 18 '16 at 16:20

2 Answers2

3

Here is another idea.By using sub we restrict it to only change the first specified delimeter of the string. We then use cSplit with the new delimeter.

library(splitstackshape)
df$conc <- sub('-', ' ', df$conc)
cSplit(df, 'conc', ' ', 'wide')
#   freq conc_1   conc_2
#1:    4      A    hello
#2:    5      A Hi-there
#3:    1      B    HELLO
Sotos
  • 51,121
  • 6
  • 32
  • 66
2

Try this, maybe not as straight forward as using the csplit function. Performance is fairly fast with this method.

#Sample Data    
s<-c("A-hello", "A-Hi-there", "B-HELLO")
df<-data.frame(s)

#split the data into 2 parts and assign to new columns in the dataframe.
library(stringr)
mat  <- matrix(unlist(str_split(df$s, "-", n=2)), ncol=2, byrow=TRUE)
dfnew<-as.data.frame(mat, stringsAsFactors = FALSE)

Once the matrix "mat" is created, one can cbind the result onto your original matrix.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • This also works and im guessing unlike sub and gsub having n=x allows for more complicated cases. Thanks! – Oli May 19 '16 at 08:13
  • 1
    I guess you could use `str_split_fixed` and avoid `matrix(unlist...` part – Sotos May 19 '16 at 08:29
  • As per Sotos comment, this works: data.frame(str_split_fixed(df$s, "-", n=2)). The result is the desired 2 column by n dataframe. – Dave2e May 19 '16 at 12:16
  • x<-(str_split(df$conc, "-", n=2) then y <- data.frame(x) then z <- t(y) then df1 <- data.frame(z) then df2 <- data.frame(sec=z$X1, word=z$X2) then df3 <- data.frame(sec=df2$sec, word=df2$word) worked too. – Oli May 19 '16 at 12:36