R: Only read files with > 0 at any place in a particular column (read conditioned on value in column)

Question

I have millions of files but some have all 0s in a particular column. I don't want to read those in.

So what I've done is read in only that particular column from all the files to get the names of the files that have >0 in that column and exclude the rest.

This seems very time-consuming/inefficient. Is there a way around it?

1. List all files

list_of_files<-list.files(pattern="someName_")

2. Select only one column

Check header names for location of column

temp<-read.table(list_of_files[1], sep="," , header=TRUE, nrows = 1)

selectCols = c("X")
selectCols = which(names(temp) %in% selectCols)

3. Read in only selectCols column from all files

tempData <-rbindlist(sapply(list_of_files, fread, select = selectCols, simplify = FALSE),
                          use.names = TRUE, idcol = "FileName")

4. Check which files have X>0

list_of_filesWithInfectiousPassenger<-
tempData$file_name[which(tempData$X > 0)] %>%
     unique()

5. Read only those files with a value greater than 0 in X.

tempData <-rbindlist(sapply(list_of_filesWithInfectiousPassenger, fread, simplify = FALSE),
                          use.names = TRUE, idcol = "FileName")

Are you on Linux? Maybe something like this + fread()? https://stackoverflow.com/questions/19602181/how-to-extract-one-column-of-a-csv-file — s_baldur, Dec 02 '20 at 11:23
Agree with @sindri_baldur. If you use the `fread()` approach, be sure to specify that the column as numeric (i.e., `colClasses = `) and maybe `header = ` as well. Just to reduce the guessing for each file n times. You can also do some quick benchmarks against `vroom` and `readr` (i.e., `vroom()` and `read_csv()`). I'd be surprised if they outperformed `fread()` but it is worth a check depending on the size of your files. Also, if you have control over it, make sure the files are somewhere local / quick for your computer to access. — Andrew, Dec 02 '20 at 11:39
Thank you both. I'm trying to keep everything inside R but awk is a very good idea. Good tip about the colClasses (I'll use that for another part too). All my files are really small so vroom is actually quite a lot slower than fread which surprised me. — HCAI, Dec 02 '20 at 12:43