Subsetting many .dat.gz files using fread and awk

Question

This is a continuation from this previous Stack Overflow question:

Fastest way to read in 100,000 .dat.gz files

I have many .dat.gz files, but many of the rows in this data have zero values that I want to avoid bringing into memory.

Create data for test case:

# Make dir
system("mkdir practice")
require(data.table)

# Function to create data
create_write_data <- function(file.nm) {
  dt <- data.table(Day=0:365)
  dt[, (paste0("V", 1:17)) := lapply(1:17, function(x) rpois(n=366, 0.1))]
  write.table(dt, paste0("./practice/",file.nm), row.names=FALSE, sep="\t", quote=FALSE)
  system(paste0("gzip ./practice/", file.nm))    
}

And here is code applying:

# Apply function to create 10 fake zipped data.frames (550 kb on disk)
tmp <- lapply(paste0("dt", 1:10,".dat"), function(x) create_write_data(x))

My solution (not working)

The linked Stack Overflow answer from before gave this great answer for reading in all the data at once:

tbl = fread('cat ./practice/*dat.gz | gunzip | grep -v "^Day"')

But now I want to filter the data where columns 14 and 15 are both not 0, so I created the following pipeline to feed to fread using an awk command:

command <- "cat ./practice/*dat.gz | gunzip | awk -F, '!/^Day/ && $14 !=0 && $15 != 0'"
fread(command)

However, this is not filtering my data at all. Any ideas to how to get the awk command to work within this workflow?

change the `grep` in cmd-1 to be the awk in cmd-2, like `...| awk -F, '/^Day/ && $14 !=0 && $15 != 0'"` . (essentially eliminating cmd-2) Good luck. — shellter, Apr 17 '16 at 18:44
Good suggestion! It makes sense to consolidate that step. I modified my question with your suggestion, but the filtering step with variables 14 and 15 still aren't working for some reason. — Mike.Gahan, Apr 17 '16 at 18:55
ok, but my point was to eliminate cmd-2. Why do you need it? If the cmd-1 was working with the `grep -v`, just replace that with the modfied awk code. At this point, I would set up small test to confirm that `$14 && $15` contain the values you think they do. Good luck. — shellter, Apr 17 '16 at 19:26
Gotcha now. I edited my question. I just checked again, and Columns 14 and 15 definitely have zeros. Strange — Mike.Gahan, Apr 17 '16 at 19:30
ah, just noticed something else, I think you want `...| gunzip -c | awk...` (can't remember now, might even need `...|gunzip -c - | awk ...` . Good luck. — shellter, Apr 17 '16 at 23:29
Hmmm still doesn't seem to work. I really appreciate all the help you are providing — Mike.Gahan, Apr 18 '16 at 01:12
ok..it seems to work with the following: `command <- "cat ./practice/*dat.gz | gunzip | awk -F, '!/^Day/' | awk '$14 != 0 || $15 != 0'"`. Is this taking 2 passes at the data? It seems like it might slow things down over many many files, but it does seem to work — Mike.Gahan, Apr 18 '16 at 01:50
No this isn't 2 passes on data. Its' pretty efficient. But missed one other minor optimization before: you can further simplify to `gunzip -c ./path/to/files*.dat.gz | awk ...` Good luck. — shellter, Apr 18 '16 at 02:36
@Mike.Gahan you may want to run this command at least once to ensure that you can assume that there aren't any embedded tabs in the tab-delimited data. `system2('cat', c('practice/*.dat.gz | gunzip | awk "{print NF}" | sort | uniq -c'))` — Clayton Stanley, Apr 19 '16 at 00:11

Niall Cosgrove · Accepted Answer · 2016-05-10T18:18:04.150

This question received an answer in the comments.

ok..it seems to work with the following:
command <- "cat ./practice/*dat.gz | gunzip | awk -F, '!/^Day/' | awk '$14 != 0 || $15 != 0'"
Is this taking 2 passes at the data? It seems like it might slow things down over many many files, but it does seem to work.

No this isn't 2 passes on data. Its' pretty efficient. But missed one other minor optimization before: you can further simplify to gunzip -c ./path/to/files*.dat.gz | awk ...

Subsetting many .dat.gz files using fread and awk

Create data for test case:

My solution (not working)

1 Answers1