The script below loads a large tsv file into memory into a data frame, then runs a linear model on each of the columns of the file. I wrote the script for a smaller file, but then tried to rerun it on a larger one, and it will not load it (1.3gb) but also does not issue an error message.
In general, in this type of circumstance, one should be able to read one column at a time from the file, which in this case would mean almost nothing is loaded into memory (this file is ~600 rows but ~1000000 columns).
However, I am very new to R and unsure how to implement such a solution. I could break the file into chunks and run on each chunk, but I would rather start to learn a better alternative to read.table() that would enable me to process potentially even larger files without holding everything in memory.
What are the best alternatives in this case?
ICHP<-read.table("aa_ra_gwas_erosions_fornb.raw",header=TRUE)
covfile<-read.table("gwas_erosion_sample_list_covs.txt",header=TRUE)
fhandle<-file("ichip_nb_model.csv","a")
fhandle2<-file("ichip_nb_model_LLR.csv","a")
nullglmmod<-zeroinfl(formula=OverllTot0 ~ sex + pc1 + cohort + ra + DisDurMonths + smoke, data=covfiledt, dist="negbin")
for (i in seq(7, ncol(ICHP), 1)) {
writeLines(colnames(ICHP)[i], con=fhandle)
writeLines(colnames(ICHP)[i], con=fhandle2)
string<-eval(parse(text = paste("ICHP$", colnames(ICHP)[i], sep="")))
glmmod<-glm.nb(OverllTot0 ~ sex + pc1 + cohort + ra + DisDurMonths + smoke + string, data=covfiledt)
anovaresults<-anova(glmmod,nullglmmod)
summ <- coef(summary(glmmod))
rownames(summ)[8] <- paste0("ICHP$", colnames(ICHP)[i])
write.table( anovaresults, file=fhandle2)
write.table( round(summ, 4), file=fhandle)
}
it is the first file, that becomes ICHPdt, that is the very large one.