2

I am trying to read a large csv file as a data table, split it into 64 chunks based on a field 'sample_name' and applying a function 'myfunction' on each of the chunks, in a parallel manner.

library(data.table)
library(plyr)
library(doMC)

registerDoMC(5) #assign 5 cores

#read large csv file with 6485845 rows, 13 columns
dt = fread('~/data/samples.csv')

#example subset of dt (I am showing only 3 columns)
#sample_name    snpprobeset_id  snp_strand
#C00060 exm1002141  +
#C00060 exm1002260  -
#C00060 exm1002276  +
#C00075 exm1002434  -
#C00075 exm1002585  -
#C00150 exm1002721  -
#C00150 exm1004566  -
#C00154 exm100481   +
#C00154 exm1004821  -

#split into 64 chunks based on column 'sample_name'.
#each chunk is passed as an argument to a function 'myfunction'
ddply(dt,.(sample_name),myfunction,.parallel=TRUE)

#function definition
myfunction <- function(arg1)
{
    #arg1 <- data.table(arg1)   
    #write columns 9,11,12 to a tab-limited bed file named 'sample_name.bed' for e.g. C00060.bed, C00075.bed and so on. 64 bed files for 64 chunks would be written out.
    write.table(arg1[,c(9,11,12)],paste("~/Desktop/",paste(unique(arg1$sample_name),".bed",sep=""),sep=""),row.names=F,quote=F,sep="\t",col.names=F)
    #execute a system command for bam-readcount (bioinformatics program)
    #build command
    p1 <- paste(unique(arg1$sample_name),".bed",sep="")
    p2 <- paste("bam-readcount -b 20 -f hg19.fa -l",p1,sep=" ")
    p3 <- paste(unique(arg1$sample_name),".bam",sep="")
    p4 <- paste(p2,p3,sep=" ")
    p5 <- paste(unique(arg1$sample_name),"_output.txt",sep="")
    p6 <- paste(p4,p5,sep=" > ")
    system(p6) #execute system command
    #executes something like this, for sample_name=C00060
    #bam-readcount -b 20 -f hg19.fa -l C00060.bed C00060.bam > C00060_output.txt
    #read back in C00060_output.txt file
    #manipulate the file..multiple steps
    #write output to another file
}

Here, when I split my datatable 'dt' based on 'sample_name' using ddply(), it gets split into dataframes and not datatables. So I am thinking of converting the dataframes into datatables once it gets passed onto the function (first line of function definition) and then do the rest of the processing with the datatable. Is there a better & efficient alternative to this?

Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
  • 1
    Almost surely yes, but it's impossible to say exactly what that would be without knowing precisely what processing you need to do. – joran Apr 03 '14 at 16:59
  • Ok, I will add a few details and update my question. – Komal Rathi Apr 03 '14 at 17:11
  • http://stackoverflow.com/questions/11562656/averaging-column-values-for-specific-sections-of-data-corresponding-to-other-col/11562850#11562850 – Ari B. Friedman Apr 03 '14 at 17:20
  • 3
    I used dt[,myfunction,by=sample_name] but it shows me an error: Error in `[.data.table`(dt, , myfunction, by = sample_name) : invalid type/length (closure/64) in vector allocation – Komal Rathi Apr 03 '14 at 17:45
  • plyr is obsolete since 2014, use dplyr instead. plyr chokes on splits of high cardinality, since it tries to allocate all the subdataframes upfront regardless of whether that'll blow out memory. – smci Mar 03 '15 at 10:32

0 Answers0