4

This is a continuation from this previous Stack Overflow question:

Fastest way to read in 100,000 .dat.gz files

I have many .dat.gz files, but many of the rows in this data have zero values that I want to avoid bringing into memory.

Create data for test case:

# Make dir
system("mkdir practice")
require(data.table)

# Function to create data
create_write_data <- function(file.nm) {
  dt <- data.table(Day=0:365)
  dt[, (paste0("V", 1:17)) := lapply(1:17, function(x) rpois(n=366, 0.1))]
  write.table(dt, paste0("./practice/",file.nm), row.names=FALSE, sep="\t", quote=FALSE)
  system(paste0("gzip ./practice/", file.nm))    
}

And here is code applying:

# Apply function to create 10 fake zipped data.frames (550 kb on disk)
tmp <- lapply(paste0("dt", 1:10,".dat"), function(x) create_write_data(x))

My solution (not working)

The linked Stack Overflow answer from before gave this great answer for reading in all the data at once:

tbl = fread('cat ./practice/*dat.gz | gunzip | grep -v "^Day"')

But now I want to filter the data where columns 14 and 15 are both not 0, so I created the following pipeline to feed to fread using an awk command:

command <- "cat ./practice/*dat.gz | gunzip | awk -F, '!/^Day/ && $14 !=0 && $15 != 0'"
fread(command)

However, this is not filtering my data at all. Any ideas to how to get the awk command to work within this workflow?

Community
  • 1
  • 1
Mike.Gahan
  • 4,565
  • 23
  • 39
  • 2
    change the `grep` in cmd-1 to be the awk in cmd-2, like `...| awk -F, '/^Day/ && $14 !=0 && $15 != 0'"` . (essentially eliminating cmd-2) Good luck. – shellter Apr 17 '16 at 18:44
  • Good suggestion! It makes sense to consolidate that step. I modified my question with your suggestion, but the filtering step with variables 14 and 15 still aren't working for some reason. – Mike.Gahan Apr 17 '16 at 18:55
  • 1
    ok, but my point was to eliminate cmd-2. Why do you need it? If the cmd-1 was working with the `grep -v`, just replace that with the modfied awk code. At this point, I would set up small test to confirm that `$14 && $15` contain the values you think they do. Good luck. – shellter Apr 17 '16 at 19:26
  • Gotcha now. I edited my question. I just checked again, and Columns 14 and 15 definitely have zeros. Strange – Mike.Gahan Apr 17 '16 at 19:30
  • 1
    ah, just noticed something else, I think you want `...| gunzip -c | awk...` (can't remember now, might even need `...|gunzip -c - | awk ...` . Good luck. – shellter Apr 17 '16 at 23:29
  • Hmmm still doesn't seem to work. I really appreciate all the help you are providing – Mike.Gahan Apr 18 '16 at 01:12
  • ok..it seems to work with the following: `command <- "cat ./practice/*dat.gz | gunzip | awk -F, '!/^Day/' | awk '$14 != 0 || $15 != 0'"`. Is this taking 2 passes at the data? It seems like it might slow things down over many many files, but it does seem to work – Mike.Gahan Apr 18 '16 at 01:50
  • 2
    No this isn't 2 passes on data. Its' pretty efficient. But missed one other minor optimization before: you can further simplify to `gunzip -c ./path/to/files*.dat.gz | awk ...` Good luck. – shellter Apr 18 '16 at 02:36
  • 1
    @Mike.Gahan you may want to run this command at least once to ensure that you can assume that there aren't any embedded tabs in the tab-delimited data. `system2('cat', c('practice/*.dat.gz | gunzip | awk "{print NF}" | sort | uniq -c'))` – Clayton Stanley Apr 19 '16 at 00:11

1 Answers1

0

This question received an answer in the comments.

ok..it seems to work with the following:
command <- "cat ./practice/*dat.gz | gunzip | awk -F, '!/^Day/' | awk '$14 != 0 || $15 != 0'"
Is this taking 2 passes at the data? It seems like it might slow things down over many many files, but it does seem to work.

No this isn't 2 passes on data. Its' pretty efficient. But missed one other minor optimization before: you can further simplify to gunzip -c ./path/to/files*.dat.gz | awk ...

Niall Cosgrove
  • 1,273
  • 1
  • 15
  • 24