I have millions of files but some have all 0s in a particular column. I don't want to read those in.
So what I've done is read in only that particular column from all the files to get the names of the files that have >0 in that column and exclude the rest.
This seems very time-consuming/inefficient. Is there a way around it?
1. List all files
list_of_files<-list.files(pattern="someName_")
2. Select only one column
Check header names for location of column
temp<-read.table(list_of_files[1], sep="," , header=TRUE, nrows = 1)
selectCols = c("X")
selectCols = which(names(temp) %in% selectCols)
3. Read in only selectCols column from all files
tempData <-rbindlist(sapply(list_of_files, fread, select = selectCols, simplify = FALSE),
use.names = TRUE, idcol = "FileName")
4. Check which files have X>0
list_of_filesWithInfectiousPassenger<-
tempData$file_name[which(tempData$X > 0)] %>%
unique()
5. Read only those files with a value greater than 0 in X.
tempData <-rbindlist(sapply(list_of_filesWithInfectiousPassenger, fread, simplify = FALSE),
use.names = TRUE, idcol = "FileName")