This is a continuation from this previous Stack Overflow question:
Fastest way to read in 100,000 .dat.gz files
I have many .dat.gz files, but many of the rows in this data have zero values that I want to avoid bringing into memory.
Create data for test case:
# Make dir
system("mkdir practice")
require(data.table)
# Function to create data
create_write_data <- function(file.nm) {
dt <- data.table(Day=0:365)
dt[, (paste0("V", 1:17)) := lapply(1:17, function(x) rpois(n=366, 0.1))]
write.table(dt, paste0("./practice/",file.nm), row.names=FALSE, sep="\t", quote=FALSE)
system(paste0("gzip ./practice/", file.nm))
}
And here is code applying:
# Apply function to create 10 fake zipped data.frames (550 kb on disk)
tmp <- lapply(paste0("dt", 1:10,".dat"), function(x) create_write_data(x))
My solution (not working)
The linked Stack Overflow answer from before gave this great answer for reading in all the data at once:
tbl = fread('cat ./practice/*dat.gz | gunzip | grep -v "^Day"')
But now I want to filter the data where columns 14 and 15 are both not 0, so I created the following pipeline to feed to fread
using an awk
command:
command <- "cat ./practice/*dat.gz | gunzip | awk -F, '!/^Day/ && $14 !=0 && $15 != 0'"
fread(command)
However, this is not filtering my data at all. Any ideas to how to get the awk command to work within this workflow?