quickest way to select/copy lines containing string from huge txt.gz file

Question

So I have the following sed one liner:

sed -e '/^S|/d' -e '/^T|/d' -e '/^#D=/d' -e '/^##/d' -e 's/H|/,H|/g' -e 's/Q|/,,Q|/g' -e '1 i\,,,' sample_1.txt > sample_2.txt

I have many lines that start with either:

S|
T|
#D=
##
H|
Q|

The idea is to not copy the lines starting with one of the first fours and to replace H| (at the beginning of lines) by ,H| and Q| (at the beginning of lines) by ,,Q|

But now I would need to:

use the fastest way possible (internet suggests (m)awk is faster than sed)
read from a .txt.gz file and save the result in a .txt.gz file, avoiding, if possible, the intermediate un-zip/re-zip

there are in fact several hundreds .txt.gz files, each about ~1GB, to process in this way (all in the same folder). Is there a CLI way to run the code on parallel on all of them (so each core will get assigned a subset of the files in the directory)?

--I use linux --ubuntu

*"I have many lines that start with either: ... `^#D=`"* - is that a typo? Are those the only line beginnings or can there be others? Are the files ~1GB compressed or uncompressed? If the former, how big uncompressed? — haukex, Jun 18 '18 at 19:39
@haukex: `^#D=` that was a type (fixed now, thanks!). Compressed files are ~1GB each, uncompressed about 20 times more. — user189035, Jun 18 '18 at 20:22

Mark Setchell · Accepted Answer · 2018-06-19T09:19:10.517

2

Untested, but likely pretty close to this with GNU Parallel.

First make output directory so as not to overwrite any valuable data:

mkdir -p output

Now declare a function that does one file and export it to subprocesses so jobs started by GNU Parallel can find it:

doit(){
    echo Processing $1
    gzcat "$1" | awk '
        /^[ST]\|/ || /^#D=/ || /^##/ {next}    # ignore lines starting S|, T| etc
        /^H\|/ {print ","}                     # prefix "H|" with ","
        /^Q\|/ {print ",,"}                    # prefix "Q|" with ",,"
        1                                      # print all other lines
    ' | gzip > output/"$1"
}
export -f doit

Now process all txt.gz files in parallel and show progress bar too:

parallel --bar doit ::: *txt.gz

edited Jun 19 '18 at 09:19

answered Jun 19 '18 at 08:53

Mark Setchell

191,897
31
273
432

1

thanks, this works really great. For future users, I had to replace `gzcat` by `gunzip -c` in ubuntu... – user189035 Jun 19 '18 at 13:15
Small comment: to get the desired output, I replaced `{print ","} ` by `{print ","$0}` and `{print ",,"}` by `{print ",,"$0} ` (probably my question was not clear on that point) – user189035 Jun 19 '18 at 14:25
quick question: do you think this could have been done quicker using `grep`? (I'm still learning these stream tricks) – user189035 Jun 19 '18 at 15:20
1

No, I don't think so. `awk` is extremely fast and capable. I suspect the bulk of the time is spent recompressing the output. In fact, try it. Replace the `| gzip ...` with `> /dev/null` and see how long the `awk` takes without recompressing. – Mark Setchell Jun 19 '18 at 15:23

rojo · Answer 2 · 2018-06-19T04:40:17.930

Was something like this what you had in mind?

#!/bin/bash

export LC_ALL=C

zcat sample_1.txt.gz | gawk '
$1 !~ /^([ST]\||#D=|##)/ {
    switch ($0) {
        case /^H\|/:
            print "," $0
            break
        case /^Q\|/:
            print ",," $0
            break
        default:
            print $0
    }
}' | gzip > sample_2.txt.gz

The export LC_ALL=C tells your environment you aren't expecting extended characters, and can profoundly speed up execution. zcat expands and dumps a gz file to stdout. That is piped into gawk, which checks that the first part of each line does not match the first four character groupings you have in your question. For lines that pass that test, output to stdout (massaged as requested). As gawk executes, its stdout gets piped into gzip and written to a .txt.gz file.

It might be possible to use xargs with the -P and -n switches to parallelize your processing, but I think GNU parallel might be easier to work with.

quickest way to select/copy lines containing string from huge txt.gz file

2 Answers2

Linked