0

I have a gz file that has several columns with headers. The first column looks something like this:

some header0   some header1
10:100000625   a
10:100000645   b
10:100002464   c
10:100003242   d
10:100003785   e
10:100004360   f

And another txt file which contains some of the first file 1st column entries (no header), ex:

 10:100002464
 10:100004360

I want to create a new gz file containing only the entries found in the txt file and keeps the headers.

some header0   some header1
10:100002464   c
10:100004360   f

The command I tried outputs a gz file with no headers. How can I keep them?

zcat my_file.gz | grep -Fw -f my_other_file.txt | gzip > my_file_new.gz
tibetish
  • 109
  • 8
  • Is there a reason that this question mentions gzip at all? You'd have the same problem if neither your inputs nor your outputs were gzipped, and you could adopt any solution that worked without gzip as part of the problem by putting a `zcat` on the front and a `gzip` on the back. – Charles Duffy Jul 20 '20 at 14:12
  • @CharlesDuffy There is no direct need, but it is useful to know that the `gz` file might be big and the user wants to avoid to unzip it twice. – kvantour Jul 20 '20 at 15:21
  • 1
    @kvantour, ...perhaps I'm missing something: It seems to me that a generic stream solution wouldn't _ever_ need to read the input twice. Even if it's just concatenating the first line with filter result, `gunzip -c bar` would be trivially adopted to `(head -n 1 < <(gunzip -c bar.gz`. – Charles Duffy Jul 20 '20 at 15:30
  • It's a >19 million lines compressed data set. – tibetish Jul 20 '20 at 15:39
  • @CharlesDuffy you are not missing anything. – kvantour Jul 20 '20 at 15:45

1 Answers1

2

Replace grep -Fw -f my_other_file.txt with:

awk 'NR==FNR{a[$1]; next} (FNR==1) || ($1 in a)' my_other_file.txt -

e.g. using cat my_file.txt on a flat file in place of zcat my_file.gz on a gzipped one:

$ cat my_file.txt | awk 'NR==FNR{a[$1]; next} (FNR==1) || ($1 in a)' my_other_file.txt -
some header0   some header1
10:100002464   c
10:100004360   f

If my_other_file.txt can contain DOS line endings (see Why does my tool output overwrite itself and how do I fix it?) then use:

awk 'NR==FNR{sub(/\r/,""); a[$1]; next} (FNR==1) || ($1 in a)' my_other_file.txt -
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thank you for your help. I tried your example with simple txt files and it worked perfectly, but when I try doing the same with compressed file I get an empty output. I'm probably misusing the output command, not sure. – tibetish Jul 20 '20 at 15:13
  • 1
    @tibetish you want to do: `zcat my_file.tgz | awk 'NR==FNR{a[$1]; next} (FNR==1) || ($1 in a)' my_other_file.txt -` . Be aware that the hyphen is important as it references `/dev/stdin` (i.e. the output of `zcat`) – kvantour Jul 20 '20 at 15:20
  • @tibetish please post **exactly** the command line you're running up to just before the `| gzip > my_file_new.gz` that's producing no output. – Ed Morton Jul 20 '20 at 15:26
  • zcat my_file.gz | awk 'NR==FNR{a[$1]; next} (FNR==1) || ($1 in a)' my_other_file.txt – tibetish Jul 20 '20 at 15:37
  • 2
    That last character looks too long to be a plain text `-`, it looks like an em-dash or something. Check that and make sure you just use a plain old minus sign. Also try changing `{a[$1]` to `{sub(/\r/,""); a[$1]` just in case you have DOS line endings in my_other_file.txt (see https://stackoverflow.com/q/45772525/1745001). – Ed Morton Jul 20 '20 at 15:42