0

The script below loads a large tsv file into memory into a data frame, then runs a linear model on each of the columns of the file. I wrote the script for a smaller file, but then tried to rerun it on a larger one, and it will not load it (1.3gb) but also does not issue an error message.

In general, in this type of circumstance, one should be able to read one column at a time from the file, which in this case would mean almost nothing is loaded into memory (this file is ~600 rows but ~1000000 columns).

However, I am very new to R and unsure how to implement such a solution. I could break the file into chunks and run on each chunk, but I would rather start to learn a better alternative to read.table() that would enable me to process potentially even larger files without holding everything in memory.

What are the best alternatives in this case?

ICHP<-read.table("aa_ra_gwas_erosions_fornb.raw",header=TRUE)
covfile<-read.table("gwas_erosion_sample_list_covs.txt",header=TRUE)
fhandle<-file("ichip_nb_model.csv","a")
fhandle2<-file("ichip_nb_model_LLR.csv","a")
nullglmmod<-zeroinfl(formula=OverllTot0 ~ sex + pc1 + cohort + ra + DisDurMonths + smoke, data=covfiledt, dist="negbin")
for (i in seq(7, ncol(ICHP), 1)) {
    writeLines(colnames(ICHP)[i], con=fhandle)
    writeLines(colnames(ICHP)[i], con=fhandle2)
        string<-eval(parse(text = paste("ICHP$", colnames(ICHP)[i], sep="")))
    glmmod<-glm.nb(OverllTot0 ~ sex + pc1 + cohort + ra + DisDurMonths + smoke + string, data=covfiledt)
    anovaresults<-anova(glmmod,nullglmmod)
        summ <- coef(summary(glmmod))
        rownames(summ)[8] <- paste0("ICHP$", colnames(ICHP)[i])
        write.table( anovaresults, file=fhandle2)
        write.table( round(summ, 4), file=fhandle)
}

it is the first file, that becomes ICHPdt, that is the very large one.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Vincent Laufer
  • 705
  • 10
  • 26
  • Use `fread` from the `data.table` package. See also here http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r – Hugh Feb 02 '15 at 04:47
  • For reading columns, you might get away by using `sqldf`. – Roman Luštrik Feb 02 '15 at 10:08

1 Answers1

1

If you want to use base R, not that concerned about speed but simply want to limit memory requirement by reading just certain columns in a tsv, you can still use read.table and set the colClasses parameter to "NULL" for columns you don't want to read.

For example if you have 1000 columns and want to just read columns 2 and 4, set everything else to "NULL".

cols <- rep("NULL", 1000)
cols[c(2, 4)] <- NA
read.table("file", colClasses=cols)
Ricky
  • 4,616
  • 6
  • 42
  • 72