0

I am attempting to convert a matrix for use outside of R in other software (Plink). I have successfully added all of the required information columns and just need to split a large number of my columns that are currently in the following format ("AA") into two separate columns ("A" "A"). My matrix is very large (~2.1 GB) and has the following dimensions (109x2443180). I have tried multiple approaches and I have had the most success with the following code run in batches of around 200,000 columns although R crashes while working on batch 8 of 13 total batches.

batch1<-do.call(cbind, 
                mclapply(snpmatrix_df[,7:200000], 
                         function(i) do.call(rbind, strsplit(as.character(i), split=''), mc.cores=cores
                         )
                )
)  

save.image(file = "/N/u/bscomer/Karst/backup15.RData")

I do not think this is a memory issue because I am working on a computing cluster with 30 GB of available RAM.

Matrix subset provided below:

My matrix snpmatrix_df[1:2, 1:15] (note: row names and column names are included in this output but will not be in final file):

    family_id individual_id paternal_id maternal_id            phenotype GA008510 GA008524 GA008529 GA008532
xxxx_001 "IID"     "xxxx_001"    "0"         "0"         "'Female'" "2"       "00"     "00"     "00"     "00"    
xxxx_002 "IID"     "xxxx_002"    "0"         "0"         "'Female'" "2"       "00"     "00"     "00"     "00"    
         GA026677 GA026703 GA026708 GA026710 GA026711
xxxx_001 "00"     "00"     "BB"     "BB"     "BB"    
xxxx_002 "00"     "00"     "BB"     "BB"     "BB" 

Desired example final format (from software website) provided below:

FAM1    NA06985 0   0   1   1   A   T   T   T   G   G   C   C   A   T   T   

Does anybody have any suggestions on the best way to approach this issue?

  • Please add some data and a reproducible example. If your code crashes because something strange in the data, make sure that the example data you provide triggers the same error you see. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. – dww Apr 20 '16 at 16:36
  • 1
    If the error is a memory issue, rbind may well be your culprit. usually it is recommended to allocate space first then fill in the matrix, rather than grow it within a loop. – dww Apr 20 '16 at 16:37
  • Sadly the data is too large to provide in order to allow a reproducible example. – Brian Comer Apr 20 '16 at 16:58
  • 1
    did you try `colsplit` of `reshape2`. `gen<-do.call(cbind,apply(df, 2, colsplit,"",c(".a1", ".a2")))`. honestly, Using other languages may be preferable in this case, like perl or python..or dividing data by chromosome, applying your function and merge later – Ananta Apr 20 '16 at 17:02
  • I do not see a problem doing this in R, but we need some data to work with, even a subset. There should be no problem to allocate a matrix of the correct size then calculate each column in turn as either the 1st or second character of the original. Please edit your post to provide a working example with data. – dww Apr 20 '16 at 17:15

0 Answers0