0

I have a large data.table of genotypes (260,000 rows by 1000 columns). The rows are markers and the columns are the subjects. The data looks like this:

             ID1         ID2         ID3         ID4
M1:          CC          CC          TC          CC
M2:          GG          GG          GG          GG
M3:          TT          TT          TT          TT
M4:          TG          TG          TG          TG
M5:          TT          TT          TT          TT
M6:          TT          TT          TT          TT

I need to split each genotype so that I have each allele in its own column like this:

             V1 V2 V3 V4 V5 V6 V7 V8
        M1:  C  C  C  C  T  C  C  C
        M2:  G  G  G  G  G  G  G  G
        M3:  T  T  T  T  T  T  T  T
        M4:  T  G  T  G  T  G  T  G
        M5:  T  T  T  T  T  T  T  T
        M6:  T  T  T  T  T  T  T  T

I have come up with two solutions, both of which work on a subset of the data, but breaks down on the entire data set due to memory issues or some internal error of data.table that I dont understand.

  1. I used strsplit on each column and stored it to a list, then used do.call to merge them all. I also parallelized it using the foreach function

      ids <- colnames(DT)
      gene.split <- function(i) { 
      as.data.table(do.call(rbind,strsplit(as.vector(eval(parse(text=paste("DT$",ids[i])))), split = "")))
      }
      all.gene <- foreach(i=1:length(ids)) %dopar% gene.split(i)
      do.call(cbind,all.gene)
    

On 4 cores this breaks down due to memory issues.

  1. The second solution is based on a similar problem which uses the set function:

     out_names <- paste("V", 1:(2*ncol(DT)), sep="_")
     invar1 <- names(DT)
    
     for (i in seq_along(invar1)) {
     set(DT, i=NULL, j=out_names[2*i-1], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,1])
     set(DT, i=NULL, j=out_names[2*i], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,2])
      }
    

which works on a few columns but then I get the following error if I try using the entire dataset:

Error in set(DT, i = NULL, j = out_names[2 * i - 1], value = do.call(rbind, : Internal logical error. DT passed to assign has not been allocated enough column slots. l=163, tl=163, adding 1

Am I going about this the wrong way?

Community
  • 1
  • 1
sahir
  • 336
  • 3
  • 8
  • Do you really need to split your genotype data? What's the next step of your analysis? – MrFlick Jun 12 '14 at 01:04
  • PLINK needs the file in this format. They do have an option that allows you to have grouped alleles, but that's only if your columns are the SNPs and the rows are the subjects. – sahir Jun 12 '14 at 01:08
  • 1
    I'm not sure R is best for this. I might use something like `awk` or `perl`: `perl -lane 's/([ACTG])([ACTG])/\1 \2/g; print' geno.txt` – MrFlick Jun 12 '14 at 01:17
  • Have a look here http://stackoverflow.com/questions/11619616/how-to-split-a-string-into-substrings-of-a-given-length, perhaps you can use `sapply` on each column/row separately and glue them back together. – Brouwer Jun 12 '14 at 01:23
  • @MrFlick perhaps you're right. I was trying to avoid learning a new language (or at least enough to get the job done). Does it matter that my data is in .csv format? – sahir Jun 12 '14 at 02:19
  • @sahir Yes, it will ultimately matter what format your data is in when you implement your solution. The problem is R likes to read all the data in at a once. In other languages it's easier to process a file line-by-line taking up virtually no memory at all. But being csv doesn't make anything easier or harder for R. – MrFlick Jun 12 '14 at 02:21

4 Answers4

3

Here is an approach using data.table::set and substr (not strsplit)

Using @jbaums example data l

# coerce to `data.table` without a copy
setDT(l)
# over allocate columns so that `data.table` can assign by reference
# this will stop the error you were seeing
alloc.col(l,3000)


out_names <- paste("V", 1:(2*ncol(l)), sep="_")
invar1 <- names(l)

for (i in seq_along(invar1)) {
  set(l, i=NULL, j=out_names[2*i-1], value=substr(l[[invar1[i]]],1,1))
  set(l, i=NULL, j=out_names[2*i], value=substr(l[[invar1[i]]],2,2))
}

The final step took 37 seconds on my Windows 7 i7 2600 machine with 8GB ram

In your example you run strsplit twice (and use do.call(rbind....)) --> not efficient.

Some benchmarking of possible approaches to the splitting....

microbenchmark(substr(l[[invar1[1L]]],2,2), sapply(strsplit(l[[invar1[1L]]],''),`[`,2L),do.call(rbind, strsplit(l[[invar1[i]]], split = ""))[,2], times=5)
Unit: milliseconds
                                                      expr        min         lq     median         uq       max neval
                             substr(l[[invar1[1L]]], 2, 2)   14.10669   14.35571   14.57485   15.78283  193.9125     5
            sapply(strsplit(l[[invar1[1L]]], ""), `[`, 2L)  345.92969 1420.03907 1944.33873 3864.82876 5371.6130     5
 do.call(rbind, strsplit(l[[invar1[i]]], split = ""))[, 2] 3318.70878 4131.38551 4155.06126 5269.92745 8414.4948     5
thelatemail
  • 91,185
  • 12
  • 128
  • 188
mnel
  • 113,303
  • 27
  • 265
  • 254
  • Didn't know about the over allocating of columns. Im assuming I can ignore the warnings I get: ...truelength (3000) is greater than 1000 items over-allocated ... – sahir Jun 12 '14 at 04:44
  • @sahir, yes you can ignore them (as the warning suggest you can given you have set n to a large (3000) number) – mnel Jun 12 '14 at 05:36
2

Here's a relatively fast approach - took ~80 sec (after dummy data creation) (Win 8.1 x64; i4770) but chewed up ~13 GB of RAM.

# Creating initial data 
pairs <- c(outer(c('C', 'T', 'G', 'A'), c('C', 'T', 'G', 'A'), 'paste0'))
l <- replicate(1000, sample(pairs, 260000, replace=TRUE), simplify=FALSE)

system.time({
  v <- do.call(paste0, l)
  rm(l); gc()
  out <- do.call(rbind, strsplit(v, ''))
  rm(v); gc()
})

#   user  system elapsed 
#  79.07    1.24   80.33 

str(out)

# chr [1:260000, 1:2000] "A" "C" "C" "C" ...
jbaums
  • 27,115
  • 5
  • 79
  • 119
0

Here's a way to do this for a data frame x:

do.call(cbind, 
        lapply(x, 
               function(i) do.call(rbind, strsplit(as.character(i), split=''))
        )
)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "C"  "C"  "C"  "C"  "T"  "C"  "C"  "C" 
[2,] "G"  "G"  "G"  "G"  "G"  "G"  "G"  "G" 
[3,] "T"  "T"  "T"  "T"  "T"  "T"  "T"  "T" 
[4,] "T"  "G"  "T"  "G"  "T"  "G"  "T"  "G" 
[5,] "T"  "T"  "T"  "T"  "T"  "T"  "T"  "T" 
[6,] "T"  "T"  "T"  "T"  "T"  "T"  "T"  "T" 

Each column is split into characters, and then r-bound together. This gives a list of columns, which are then passed to cbind.

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
-1
## make a small data.table for testing
dd <- data.table(ID1=c("CC","TG"),ID2=c("CC","TG"), ID3=c("TC","TG"))
dd
##    ID1 ID2 ID3
## 1:  CC  CC  TC
## 2:  TG  TG  TG

## the first base
apply(dd,1:2,function(e) strsplit(e,split='')[[1]][1])
##      ID1 ID2 ID3
## [1,] "C" "C" "T"
## [2,] "T" "T" "T"

## the second base
apply(dd,1:2,function(e) strsplit(e,split='')[[1]][2])
##      ID1 ID2 ID3
## [1,] "C" "C" "C"
## [2,] "G" "G" "G"

## These results are in matrix, if you need data.table use as.data.table to convert them back.
xb.
  • 1,617
  • 11
  • 16
  • Apply will coerce the data.table to a matrix. This is not memory efficient. – mnel Jun 12 '14 at 02:16
  • I do not agree! In general, matrix is more memory efficient for homogeneous elements. Refer to: [Data frame or matrix?](http://stackoverflow.com/questions/5158790/data-frame-or-matrix) – xb. Jun 12 '14 at 02:36
  • 1
    It may be more efficient to work with matrices, however, your example above will create an (*internal*) copy of the data.table object each time – mnel Jun 12 '14 at 02:38
  • Sure and thanks! It should be better to start with a matrix instead of data.table, unless other (column-wise) operations are considered other than those above :) – xb. Jun 12 '14 at 02:46