Using bash to query a large tab delimited file

Question

I have a list of names and IDs (50 entries)

cat input.txt

name    ID
Mike    2000
Mike    20003
Mike    20002

And there is a huge zipped file (13GB)

zcat clients.gz

name    ID  comment
Mike    2000    foo
Mike    20002   bar
Josh    2000    cake
Josh    20002   _

My expected output is

NR  name    ID  comment
1    Mike   2000    foo
3    Mike   20002   bar

each $1"\t"$2 of clients.gz is a unique identifier. There might be some entries from input.txt that might be missing from clients.gz. Thus, I would like to add the NR column to my output to find out which are missing. I would like to use zgrep. awk takes a very long time (since I had to zcat for uncompress the zipped file I assume?)

I know that zgrep 'Mike\t2000' does not work. The NR issue I can fix with awk FNR I imagine.

So far I have:

awk -v q="'" 
'
NR > 1 {
print "zcat clients.gz | zgrep -w $" q$0q
}' input.txt |
bash > subset.txt

been thinking about this for a while. I mixed zgrep with awk and am running an array. Though am not sure if there is an easier way? I could not figure out how to do `zgrep '$Mike\t2002'` — Shahin, Jan 26 '20 at 08:30
Wouldn't it make more sense to print out the entries that aren't present in the clients file if that's what you're interested in? — Shawn, Jan 27 '20 at 03:22
@Shawn The comment column is what I am interested in and there are a few more things I simplified. — Shahin, Jan 27 '20 at 03:34

Cyrus · Answer 1 · 2020-01-26T09:11:07.157

1

With GNU awk and bash:

awk 'BEGIN{FS=OFS="\t"} 
     # process input.txt
     NR==FNR{
       a[$1,$2]=$1 FS $2
       line[$1,$2]=NR-1
       next
     }
     # process <(zcat clients.gz)
     {
       $4=a[$1,$2]
       if(FNR==1)
         line[$1,$2]="NR"
       if($4!="")
         print line[$1,$2],$1,$2,$3
     }' input.txt <(zcat clients.gz)

Output:

NR      name    ID      comment
1       Mike    2000    foo
3       Mike    20002   bar

As one line:

awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1,$2]=$1 FS $2; line[$1,$2]=NR-1; next} {$4=a[$1,$2]; if(FNR==1) line[$1,$2]="NR"; if($4!="")print line[$1,$2],$1,$2,$3}' input.txt <(zcat clients.gz)

See: Joining two files based on two key columns awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

edited Jan 26 '20 at 09:11

answered Jan 26 '20 at 09:05

Cyrus

84,225
14
89
153

would this be faster than zgrep? – Shahin Jan 26 '20 at 09:07
I can't answer that question. – Cyrus Jan 26 '20 at 09:09
Thanks! this looks very clean with NR-1. I assume it writes the input file which is small into memory (loop through the first section), then scans the second file (zcat clients.gz) – Shahin Jan 26 '20 at 09:30
Yes, `NR==FNR{...; next}` loops only through first file and reads column 1 and 2 in array `A` and current line number to array `line`. Second section loops then only through the second "file" and checks if there is a match in array A. – Cyrus Jan 26 '20 at 09:39
1

Per [this item](https://www.gnu.org/software/gawk/manual/html_node/Reference-to-Elements.html#Reference-to-Elements) ... isn't `$4=a[$1,$2]` going to create an array element for each row produced by `zcat clients.gz`? wouldn't `if ( ($1 FS $2) in a)` require less memory? – markp-fuso Jan 26 '20 at 14:12
@markp yes, as written `a[]` would end up with an entry for every unique $1,$2 combination in the file that's 13GB **when compressed**. – Ed Morton Jan 26 '20 at 17:14
1

@EdMorton thanks, I wanted to make sure I wasn't missing something esoteric between a) array reference as a test (the exchange you and I had yesterday) and b) array reference in an assignment – markp-fuso Jan 26 '20 at 17:26
By assigning to a field ($4) it will also force awk to recompile every line that's read so using a variable instead of $4 would also be more efficient. – Ed Morton Jan 26 '20 at 17:41
This took ~ 30 mins with a max memory was 62GB – Shahin Jan 27 '20 at 02:17
@user171558 did you test the other solutions posted? How long and how much memory did they take? – Ed Morton Jan 27 '20 at 14:26

tshiono · Answer 2 · 2020-01-26T10:22:20.490

[EDIT]
I've misunderstood where the prepended line numbers come from. Corrected.

Would you try the following:

declare -A num          # asscoiates each pattern to the line number
mapfile -t ary < <(tail -n +2 input.txt)
pat=$(IFS='|'; echo "${ary[*]}")
for ((i=0; i<${#ary[@]}; i++)); do num[${ary[i]}]=$((i+1)); done
printf "%s\t%s\t%s\t%s\n" "NR" "name" "ID" "comment"
zgrep -E -w "$pat" clients.gz | while IFS= read -r line; do
    printf "%d\t%s\n" "${num[$(cut -f 1-2 <<<"$line")]}" "$line"
done

Output:

NR  name    ID  comment
1   Mike    2000    foo
3   Mike    20002   bar

The second line and third generate a search pattern as Mike 2000|Mike 20003|Mike 20002 from input.txt.
The line for ((i=0; i<${#ary[@]}; i++)); do .. creates a map from the pattern to the number.
The expression "${num[$(cut -f 1-2 <<<"$line")]}" retrieves the line number from the 1st and 2nd fields of the output.

If the performance is not still satisfactory, please consider ripgrep which is much faster than grep or zgrep.

Ed Morton · Accepted Answer · 2020-01-26T17:06:09.283

1

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ key = $1 FS $2 }
NR == FNR { map[key] = (NR>1 ? NR-1 : "NR"); next }
key in map { print map[key], $0 }

$ zcat clients.gz | awk -f tst.awk input.txt -
NR      name    ID      comment
1       Mike    2000    foo
3       Mike    20002   bar

edited Jan 26 '20 at 17:06

answered Jan 26 '20 at 16:58

Ed Morton

188,023
17
78
185

Thanks a lot for your help. It returns an empty output on the test files. Am trying to understand where it goes wrong – Shahin Jan 27 '20 at 16:22
The usual suspect is DOS line endings. Try running dos2unix or similar on input.txt then run the command again. See https://stackoverflow.com/q/45772525/1745001 for more info on that issue. If it's not that that then make sure your fields in both files truly are tab-separated as you said they are. – Ed Morton Jan 27 '20 at 16:26
I was dropping the hyphen mark; I fixed. What does it do? – Shahin Jan 27 '20 at 16:43
The script? Creates a mapping of the 2 key fields to the line number from input.txt and then for every key field from clients.gz that exist in that map (i.e. were present in input.txt) prints the associated line number followed by the current line. Let me know if you have any other questions. – Ed Morton Jan 27 '20 at 16:55
`zcat clients.gz | awk -f tst.awk input.txt -` The hyphen after input.txt – Shahin Jan 27 '20 at 16:55
1

Oh, in just about any UNIX tool `-` in place of a file name means "stdin" so that's what's telling awk to read the input comping from the `zcat ... |` pipeline after it reads input.txt. It's similar to `<(zcat clients.gz)` in Cyrus' script but not bash-specific. – Ed Morton Jan 27 '20 at 16:57
1

took 20 mins and 2.5M memory. One last issue, how do I add the column names? – Shahin Jan 27 '20 at 17:41
1

They should be printed by default (see the output in my answer). If they aren't then the first line of each file doesn't look like what you showed in the example in your question. – Ed Morton Jan 27 '20 at 18:01
something else came to my mind! `zcat file.gz | awk -f tst.awk input.txt` prints file.gz line by line and compares it with input.txt? Using <(zcat file.gz) <(cat input.txt) is memory intensive because it decompresses and prints file.gz and input to memory? – Shahin Jan 28 '20 at 02:28
Not into memory but into temp files on your disk. – Ed Morton Jan 30 '20 at 04:36

Using bash to query a large tab delimited file

3 Answers3