5

I have this table (data1) with four columns

SNP rs6576700 rs17054099 rs7730126
sample1 G-G T-T G-G

I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this :

SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126
sample1 G G T T C C

With the following function I could split all columns at the time but the output is not what I need.

split <- function(x){
    x <- as.character(x)
    strsplit(as.character(x), split="-")
  }

data2=apply(data1[,-1], 2, split)

data2
$rs17054099
$rs17054099[[1]]
[1] "T" "T"


$rs7730126
$rs7730126[[1]]
[1] "G" "G"


$rs6576700
$rs6576700[[1]]
[1] "C" "C"

In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame)

> n <- max(sapply(data2, length))
> l <- lapply(data2, function(X) c(X, rep(NA, n - length(X))))
> data.frame(t(do.call(cbind, l)))
           t.do.call.cbind..l..
rs17054099                 T, T
rs7730126                  G, G
rs2061700                  C, C

If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file.

I would like to have the solution in R to make it part of a pipeline.

I forgot to say that I need to apply this to a million columns.

Community
  • 1
  • 1
Sami
  • 53
  • 6
  • How do you expect to differentiate between columns with the same name? --hint:You can't. First, you need to define explicit column names. – N8TRO Aug 13 '15 at 21:53
  • At the end I will not use the header. I want to keep it first to be sure about the order of the SNPs. I may add rs17054099.1 and rs17054099.2. I need to create a plink ped file, so then I will add the info of each sample (FID, IDD, etc). Thank you for your interest in my question. Sami – Sami Aug 13 '15 at 21:58
  • OK, thanks for the suggestion. If the order of the columns do not change the header is not important. – Sami Aug 13 '15 at 21:59
  • [This question is already answered here](http://stackoverflow.com/questions/18154556/r-split-text-string-in-a-data-table-columns). Data.table is also a **very fast** solution for big data. – N8TRO Aug 13 '15 at 22:11
  • @DavidArenburg Good to know. Haven't tried it myself. I'll have to look at it a little more closely. I was thinking an `lapply`-`tstrsplit` combo.. Of course `cSplit` is much more straightforward. – N8TRO Aug 13 '15 at 22:20
  • @N8TRO the solution using splitstackshape worked. Thank you, Sami – Sami Aug 13 '15 at 22:20
  • @N8TRO Actually when I tried it again, it did work. Don't know why it failed in the first try, though you will need to define the column names in some smart way. I'll try to come up with something in a short. – David Arenburg Aug 13 '15 at 22:21

2 Answers2

7

This is straight forward using the splitstackshape::cSplit function. Just specify the column indices within the splitCols parameter, and the separator within to the sep parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified type.convert = FALSE so T values won't become TRUE. The default direction is wide, so you don't need to specify it.

library(splitstackshape)
cSplit(data1, 2:4, sep = "-", type.convert = FALSE)
#        SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2
# 1: sample1           G           G            T            T           G           G

Here's a solution as per the provided link using the tstrsplit function for the devel version of data.table on GH. in here, we will define the index by subletting the column names first, and then we will number them using paste The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data

library(data.table) ## V1.9.5+
indx <- names(data1)[2:4]
setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx]
data1
#        SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262
# 1: sample1       G-G        T-T       G-G          G          G           T           T          G          G
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Hi thank you for your answer. I tried cSplit and it did not work. I forgot to add that I have a file with nearly a million columns, so I need a way to not specify each column at the time. Thank you, – Sami Aug 13 '15 at 22:01
  • What exactly didn't work? Can you provide the a `dput` of your data set? See my edit. You can specify the column indices instead of names. So you can do something like `2:1e3`, for example. – David Arenburg Aug 13 '15 at 22:02
  • It should be my fault, I cannot specify several columns splitGeno1 = cSplit(data1, splitcols=data1[,2:4], seps="-") Error in cSplit(data2, splitcols = data2[, 2:4], seps = "-") : unused arguments (splitcols = data2[, 2:4], seps = "-") – Sami Aug 13 '15 at 22:06
  • But you are using a wrong syntax. Why `data2[, 2:4]`? This isn't what I've showed. Try `cSplit(data2, splitcols = 2:4, seps = "-", type.convert = FALSE)` – David Arenburg Aug 13 '15 at 22:13
  • 1
    I was wrong when I first tried by myself. Now I tried you command it worked perfectly, thank you very much, Sami – Sami Aug 13 '15 at 22:14
  • I thought i had seen it before. Nevermind, may be I am wrong. Anyway (+1) – akrun Aug 14 '15 at 07:32
2

Here you want to use apply over the rows instead of columns:

df <- rbind(c("SNP", "rs6576700", "rs17054099", "rs7730126"),
c("sample1", "G-G", "T-T", "G-G"),
c("sample2", "C-C", "T-T", "G-C"))

t(apply(df[-1,], 1, function(col) unlist(strsplit(col, "-"))))
#    [,1]      [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] "sample1" "G"  "G"  "T"  "T"  "G"  "G" 
#[2,] "sample2" "C"  "C"  "T"  "T"  "G"  "C"  
mattdevlin
  • 1,045
  • 2
  • 10
  • 17