Split a column of values delimited by colons into separate columns for each value

Question

I have a table of stings and numbers as below:

           V1                  V2
1  GT:AD:DP:GQ:PL  0/1:10,45:55:70:106,0,70
2  GT:AD:DP:GQ:PL  1/1:2,42:44:16:288,16,0
3  GT:AD:DP:GQ:PL  1/1:3,37:40:14:147,14,0
4  GT:AD:DP:GQ:PL  0/1:7,50:57:55:250,0,55

For vector V2, I would like to split the ':'- delimited (colon-delimited) values into separate columns for each value, e.g.:

   V1              V2   V3     V4  V5  V6
1  GT:AD:DP:GQ:PL  0/1  10,45  55  70  106,0,70

score 3 · Answer 1 · edited Feb 04 '14 at 01:39

3

Using read.table twice with 2 different separators:

txt = '           V1                  V2
1  GT:AD:DP:GQ:PL  0/1:10,45:55:70:106,0,70
2  GT:AD:DP:GQ:PL  1/1:2,42:44:16:288,16,0
3  GT:AD:DP:GQ:PL  1/1:3,37:40:14:147,14,0
4  GT:AD:DP:GQ:PL  0/1:7,50:57:55:250,0,55'

## here replace text=txt with your file name
dat <- read.table(text=txt,header=TRUE,stringsAsFactors=FALSE)
data.frame(x1=dat$V1,read.table(text=dat$V2,sep=':'))

              x1  V1    V2 V3 V4       V5
1 GT:AD:DP:GQ:PL 0/1 10,45 55 70 106,0,70
2 GT:AD:DP:GQ:PL 1/1  2,42 44 16 288,16,0
3 GT:AD:DP:GQ:PL 1/1  3,37 40 14 147,14,0
4 GT:AD:DP:GQ:PL 0/1  7,50 57 55 250,0,55

edited Feb 04 '14 at 01:39

Max

21,123
5
49
71

answered Jan 16 '14 at 08:24

agstudy

119,832
17
199
261

@Jeremy I don't get your point. Looks that you try to speak about a general case but it is not clear for me (at least the vcf format the way you describe it). You assume something that is not in the OP ( I can't guess that the OP has other columns)..Maybe You can add an example to your answer ... – agstudy Jan 16 '14 at 09:01
yeah I know more about the data than the OP gave. It is two columns from a variant call file (vcf) obtained from aligning genetic sequence data to a reference and calling variants. Your solution works fine for the sample data given. – JeremyS Jan 16 '14 at 09:06
@Jeremy What I mean , it will be helpful and more interesting is if you provide the general fil format to adapt my solution ( even I don't see how it fails) – agstudy Jan 16 '14 at 09:19
Oh right, this site has the first few lines of one: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 – JeremyS Jan 16 '14 at 09:34

score 3 · Answer 2 · answered Jan 16 '14 at 08:25

Another way to do it

data.frame(DF$V1, do.call(rbind, strsplit(DF$V2, split = ":", fixed = TRUE)))
##            DF.V1  X1    X2 X3 X4       X5
## 1 GT:AD:DP:GQ:PL 0/1 10,45 55 70 106,0,70
## 2 GT:AD:DP:GQ:PL 1/1  2,42 44 16 288,16,0
## 3 GT:AD:DP:GQ:PL 1/1  3,37 40 14 147,14,0
## 4 GT:AD:DP:GQ:PL 0/1  7,50 57 55 250,0,55

score 3 · Answer 3 · edited May 23 '17 at 11:54

I've included a family of functions called concat.split in my "splitstackshape" package, one of which is concat.split.multiple. Under the hood, it is like @agstudy's answer, but allows you to split multiple columns at once.

Usage is simple:

library(splitstackshape)
### Three required arguments: The input dataset,
###   a vector of the columns that need to be split up
###   (can also be the numeric column position), and the 
###   separator that should be used (can be different 
###   for each column).
concat.split.multiple(data = dat, split.cols = c("V2"), seps = ":")
#               V1 V2_1  V2_2 V2_3 V2_4     V2_5
# 1 GT:AD:DP:GQ:PL  0/1 10,45   55   70 106,0,70
# 2 GT:AD:DP:GQ:PL  1/1  2,42   44   16 288,16,0
# 3 GT:AD:DP:GQ:PL  1/1  3,37   40   14 147,14,0
# 4 GT:AD:DP:GQ:PL  0/1  7,50   57   55 250,0,55

See also this answer and this Gist for an idea for where the development of the function might be headed. The "data.table" variant will be much faster on larger datasets, but the data must be "rectangular" (that is, the resulting number of columns after the split must be balanced).

JeremyS · Accepted Answer · 2014-01-16T08:53:00.630

call that table vcf

vcf.info <- data.frame(t(sapply(vcf[,2], function(y) strsplit(y,split=":")[[1]])))

then cbind that with the original vcf column(s) that you want

vcf.info2 <- cbind(vcf[,1],vcf.info)

but in a real vcf I would

vcf.info2 <- cbind(vcf[,c(1,2,4,5,6,8,9)],vcf.info)

Something else you may find useful, in this case I am just getting the read depth, replace n with however many samples you have, and the 3 with 1 to 5 for GT,AD,DP,GQ,PL

selectReadDepth <- apply(vcf[,10:n],2,function(x) sapply(x, function(y) strsplit(y,split=":")[[1]][3]))

Split a column of values delimited by colons into separate columns for each value

4 Answers4