-1

I have a question similar to the one found here

If I have a data structure like the following:

 ROW_NO.        INPUT          STRAND
       1  1,888639,T,C             -1
       2  1,889158,G,C             NA                               
       3  1,889159,A,C             NA                                     
       4 1,978978,GC,G              1                                      
       5  1,982941,T,C             NA                                      
       6 1,1888193,C,A             -1

What do I need to do in order to split the INPUT column like so:

 ROW_NO.        INPUT    Chrom     Position    Ref.y   Variant.y   
       1  1,888639,T,C    chr1       888639        T           C
       2  1,889158,G,C    chr1       889158        G           C                            
       3  1,889159,A,C    chr1       889159        A           C                                         
       4 1,978978,GC,G    chr1       978978       GC           G        
       5  1,982941,T,C    chr1       982941        T           C        
       6 1,1888193,C,A    chr1      1888193        C           A

And I'd keep the STRAND column where it was, though somehow while formatting everything got messed up here so I left it out.

Community
  • 1
  • 1
soosus
  • 1,211
  • 4
  • 18
  • 27

3 Answers3

2

Calling your dataset df:

library(stringr)
result <- data.frame(df,do.call(rbind,str_split(df$INPUT,",")))
result
#   ROW_NO.         INPUT STRAND X1      X2 X3 X4
# 1       1  1,888639,T,C     -1  1  888639  T  C
# 2       2  1,889158,G,C     NA  1  889158  G  C
# 3       3  1,889159,A,C     NA  1  889159  A  C
# 4       4 1,978978,GC,G      1  1  978978 GC  G
# 5       5  1,982941,T,C     NA  1  982941  T  C
# 6       6 1,1888193,C,A     -1  1 1888193  C  A

The rest is cosmetics:

colnames(result)[4:7] <- c("Chrom","Position","Ref.y","Variant.y")
result$Chrom <- paste0("chr",result$Chrom)

EDIT A note on the comments.

The suggested alternative:

result <- data.frame(df,do.call(rbind,strsplit(df$INPUT,",")))

fails if df$INPUT is a factor, which it might or might not be, depending on how you read in the data. The solution above, using str_split(...), does not suffer from this deficiency.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • 1
    I'm not sure what `stringr` is adding here, since `result <- data.frame(d,do.call(rbind,strsplit(d$INPUT,",")))` using just base will give the same result. – Thomas May 05 '14 at 21:42
  • 1
    @Thomas, it's adding some waiting time :-) – A5C1D2H2I1M1N2O1R2T1 May 06 '14 at 02:53
  • Regarding your edit, even if `df$INPUT` is a factor, using `strsplit(as.character(df$INPUT), ...)` would still be faster. Unfortunately, some of the functions in "stringr" provide convenience at a very big time cost, and `str_split` is one of them. – A5C1D2H2I1M1N2O1R2T1 May 06 '14 at 05:11
  • 1
    My recommendation would be: `data.frame(mydf, do.call(rbind, strsplit(as.character(mydf$INPUT), ",", fixed = TRUE)))`. The `fixed = TRUE` will also give a good speed boost. – A5C1D2H2I1M1N2O1R2T1 May 06 '14 at 05:13
2

If d is your dataframe, you can do this in base by applying the read.csv function to the INPUT column:

> out <- cbind(d, read.csv(text=d$INPUT, header=FALSE,
                           col.names=c('Chrom','Position','Ref.y','Variant.y')))
> out
  ROW_NO.         INPUT STRAND Chrom Position Ref.y Variant.y
1       1  1,888639,T,C     -1     1   888639     T         C
2       2  1,889158,G,C     NA     1   889158     G         C
3       3  1,889159,A,C     NA     1   889159     A         C
4       4 1,978978,GC,G      1     1   978978    GC         G
5       5  1,982941,T,C     NA     1   982941     T         C
6       6 1,1888193,C,A     -1     1  1888193     C         A

And, as in @jlhoward's answer, you can just use paste to get Chrom looking the way you want:

out$Chrom <- paste0('chr',out$Chrom)
Thomas
  • 43,637
  • 12
  • 109
  • 140
2

I would recommend concat.split from my "splitstackshape" package:

library(splitstackshape)
concat.split(mydf, "INPUT", ",")
#   ROW_NO.         INPUT STRAND INPUT_1 INPUT_2 INPUT_3 INPUT_4
# 1       1  1,888639,T,C     -1       1  888639       T       C
# 2       2  1,889158,G,C     NA       1  889158       G       C
# 3       3  1,889159,A,C     NA       1  889159       A       C
# 4       4 1,978978,GC,G      1       1  978978      GC       G
# 5       5  1,982941,T,C     NA       1  982941       T       C
# 6       6 1,1888193,C,A     -1       1 1888193       C       A

There is a faster version I've been working on that is currently only at this Gist. Eventually, it will replace the existing concat.split function.

cSplit(mydf, "INPUT", ",")
#    ROW_NO. STRAND INPUT_1 INPUT_2 INPUT_3 INPUT_4
# 1:       1     -1       1  888639       T       C
# 2:       2     NA       1  889158       G       C
# 3:       3     NA       1  889159       A       C
# 4:       4      1       1  978978      GC       G
# 5:       5     NA       1  982941       T       C
# 6:       6     -1       1 1888193       C       A
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485