I am attempting to convert a matrix for use outside of R in other software (Plink). I have successfully added all of the required information columns and just need to split a large number of my columns that are currently in the following format ("AA") into two separate columns ("A" "A"). My matrix is very large (~2.1 GB) and has the following dimensions (109x2443180). I have tried multiple approaches and I have had the most success with the following code run in batches of around 200,000 columns although R crashes while working on batch 8 of 13 total batches.
batch1<-do.call(cbind,
mclapply(snpmatrix_df[,7:200000],
function(i) do.call(rbind, strsplit(as.character(i), split=''), mc.cores=cores
)
)
)
save.image(file = "/N/u/bscomer/Karst/backup15.RData")
I do not think this is a memory issue because I am working on a computing cluster with 30 GB of available RAM.
Matrix subset provided below:
My matrix snpmatrix_df[1:2, 1:15] (note: row names and column names are included in this output but will not be in final file):
family_id individual_id paternal_id maternal_id phenotype GA008510 GA008524 GA008529 GA008532
xxxx_001 "IID" "xxxx_001" "0" "0" "'Female'" "2" "00" "00" "00" "00"
xxxx_002 "IID" "xxxx_002" "0" "0" "'Female'" "2" "00" "00" "00" "00"
GA026677 GA026703 GA026708 GA026710 GA026711
xxxx_001 "00" "00" "BB" "BB" "BB"
xxxx_002 "00" "00" "BB" "BB" "BB"
Desired example final format (from software website) provided below:
FAM1 NA06985 0 0 1 1 A T T T G G C C A T T
Does anybody have any suggestions on the best way to approach this issue?