I have a large data.table
of genotypes (260,000 rows by 1000 columns). The rows are markers and the columns are the subjects. The data looks like this:
ID1 ID2 ID3 ID4
M1: CC CC TC CC
M2: GG GG GG GG
M3: TT TT TT TT
M4: TG TG TG TG
M5: TT TT TT TT
M6: TT TT TT TT
I need to split each genotype so that I have each allele in its own column like this:
V1 V2 V3 V4 V5 V6 V7 V8
M1: C C C C T C C C
M2: G G G G G G G G
M3: T T T T T T T T
M4: T G T G T G T G
M5: T T T T T T T T
M6: T T T T T T T T
I have come up with two solutions, both of which work on a subset of the data, but breaks down on the entire data set due to memory issues or some internal error of data.table
that I dont understand.
I used
strsplit
on each column and stored it to a list, then useddo.call
to merge them all. I also parallelized it using theforeach
functionids <- colnames(DT) gene.split <- function(i) { as.data.table(do.call(rbind,strsplit(as.vector(eval(parse(text=paste("DT$",ids[i])))), split = ""))) } all.gene <- foreach(i=1:length(ids)) %dopar% gene.split(i) do.call(cbind,all.gene)
On 4 cores this breaks down due to memory issues.
The second solution is based on a similar problem which uses the
set
function:out_names <- paste("V", 1:(2*ncol(DT)), sep="_") invar1 <- names(DT) for (i in seq_along(invar1)) { set(DT, i=NULL, j=out_names[2*i-1], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,1]) set(DT, i=NULL, j=out_names[2*i], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,2]) }
which works on a few columns but then I get the following error if I try using the entire dataset:
Error in set(DT, i = NULL, j = out_names[2 * i - 1], value = do.call(rbind, : Internal logical error. DT passed to assign has not been allocated enough column slots. l=163, tl=163, adding 1
Am I going about this the wrong way?