I am trying to read a large csv file as a data table, split it into 64 chunks based on a field 'sample_name' and applying a function 'myfunction' on each of the chunks, in a parallel manner.
library(data.table)
library(plyr)
library(doMC)
registerDoMC(5) #assign 5 cores
#read large csv file with 6485845 rows, 13 columns
dt = fread('~/data/samples.csv')
#example subset of dt (I am showing only 3 columns)
#sample_name snpprobeset_id snp_strand
#C00060 exm1002141 +
#C00060 exm1002260 -
#C00060 exm1002276 +
#C00075 exm1002434 -
#C00075 exm1002585 -
#C00150 exm1002721 -
#C00150 exm1004566 -
#C00154 exm100481 +
#C00154 exm1004821 -
#split into 64 chunks based on column 'sample_name'.
#each chunk is passed as an argument to a function 'myfunction'
ddply(dt,.(sample_name),myfunction,.parallel=TRUE)
#function definition
myfunction <- function(arg1)
{
#arg1 <- data.table(arg1)
#write columns 9,11,12 to a tab-limited bed file named 'sample_name.bed' for e.g. C00060.bed, C00075.bed and so on. 64 bed files for 64 chunks would be written out.
write.table(arg1[,c(9,11,12)],paste("~/Desktop/",paste(unique(arg1$sample_name),".bed",sep=""),sep=""),row.names=F,quote=F,sep="\t",col.names=F)
#execute a system command for bam-readcount (bioinformatics program)
#build command
p1 <- paste(unique(arg1$sample_name),".bed",sep="")
p2 <- paste("bam-readcount -b 20 -f hg19.fa -l",p1,sep=" ")
p3 <- paste(unique(arg1$sample_name),".bam",sep="")
p4 <- paste(p2,p3,sep=" ")
p5 <- paste(unique(arg1$sample_name),"_output.txt",sep="")
p6 <- paste(p4,p5,sep=" > ")
system(p6) #execute system command
#executes something like this, for sample_name=C00060
#bam-readcount -b 20 -f hg19.fa -l C00060.bed C00060.bam > C00060_output.txt
#read back in C00060_output.txt file
#manipulate the file..multiple steps
#write output to another file
}
Here, when I split my datatable 'dt' based on 'sample_name' using ddply(), it gets split into dataframes and not datatables. So I am thinking of converting the dataframes into datatables once it gets passed onto the function (first line of function definition) and then do the rest of the processing with the datatable. Is there a better & efficient alternative to this?