2

I have a tab delimited text file (animals.txt) with five columns:

302947298 2340974238 0 0 cat
345098948 8345988989 0 0 dog
098982388 2098340923 0 0 fish
932840923 0923840988 0 0 parrot

I have another file, mess.txt.gz, which is compressed using GNU zip (.gz file). It basically looks like a massive string of letters:

sdihfoiahdfosparrotdhiafoihsdfoijaslkdogoieufoiweuf

Basically, for every line in the tab delimited text file, I want to see if any of the animal names are present within this .gz file.

Ideally, it would return something like this:

302947298 2340974238 0 0 cat no
345098948 8345988989 0 0 dog yes
098982388 2098340923 0 0 fish no
932840923 0923840988 0 0 parrot yes

At the moment I am doing the following:

gunzip -cd mess.txt.gz | grep cat
gunzip -cd mess.txt.gz | grep dog

To automate it, I've tried the following:

cat animals.txt | awk '{print $5}' > animal_names.txt

cat animal_names.txt | while read line 
do
   gunzip -cd mess.txt.gz | grep $line > output.txt
done

I've also tried:

cat animal_names.txt | while read line 
do
   if [ gunzip -cd mess.txt.gz | grep $line ]
   then
     echo "Yes"
   else
     echo "No"
   fi
   ; do
done > output.txt

What is the best way to do this in bash?

anubhava
  • 761,203
  • 64
  • 569
  • 643
icedcoffee
  • 935
  • 1
  • 6
  • 18
  • As an aside, `if [ gunzip -cd mess.txt.gz | grep $line ]` is a syntax error; even if it wasn't, it just checks whether `gunzip` etc is not an empty string, which of course it isn't. Perhaps see also https://stackoverflow.com/questions/36371221/checking-the-success-of-a-command-in-a-bash-if-statement – tripleee Jun 04 '21 at 08:46
  • Perhaps see also why you may want to avoid a [useless use of `cat`](https://stackoverflow.com/questions/11710552/useless-use-of-cat) – tripleee Jun 04 '21 at 08:48

4 Answers4

5

You can pass all the search strings to zgrep -Ff - in one pass:

cut -f5 animals.txt |
zgrep -Ff - mess.txt.gz

The -F option says to look for literal strings, not regular expressions (avoids false positives if the input contains dots or other regex metacharacters, and besides, will be significantly faster) and -f - says to read the search patterns from standard input (i.e. from the pipe from cut).

If you want a list of the matched animals, add an -o option and a brief postprocessing step;

cut -f5 animals.txt |
zgrep -Ff - -o mess.txt.gz |
sort | uniq -c

You can replace | uniq -c with just -u if you don't care how many there were of each.

This works as intended on Linux with GNU grep, but macOS (and thus probably generally *BSD) grep -o only prints the first match in each input line when combined with -f -. If you need *BSD portability, I'd go with either of the other solutions here (currently there's one for sed and one for Awk).

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    This would show, which animals are in mess.txt, but is missing the _yes/no_ information the OP wants to know. Of course this is a mere cosmetic issue; such a table could be generated easily afterwards, based on the output of zgrep. – user1934428 Jun 04 '21 at 08:35
  • @user1934428 Thanks; added a brief postprocessing step which doesn't do exactly that, but at least should offer some direction. – tripleee Jun 04 '21 at 08:38
1

You may use this awk solution with gzcat:

awk 'BEGIN{FS=OFS="\t"} FNR==NR {s=s $0; next} {print $0, (index(s, $NF) > 1 ? "yes" : "no")}' <(gzcat mess.txt.gz) animals.txt

302947298  2340974238  0  0  cat     no
345098948  8345988989  0  0  dog     yes
098982388  2098340923  0  0  fish    no
932840923  0923840988  0  0  parrot  yes

A more readable form:

awk '
BEGIN {FS=OFS="\t"}
FNR == NR {
   s = s $0
   next
}
{
   print $0, (index(s, $NF) > 1 ? "yes" : "no")
}
' <(gzcat mess.txt.gz) animals.txt
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

What about this?

gunzip -cd mess.txt.gz | grep "$(< animals.txt sed -e 's/.*\t//' | sed -z 's/\n/\\|/g;s/\\|$//')"

It is basically the version of your

gunzip -cd mess.txt.gz | grep dog

where, instead of dog, the regex dog\|cat\|whatever is generated from the file animals.txt.

My command should give you the output that you get with the example you write after

To automate it, I've tried the following:

with which you don't end up with the result you refer to as ideal.

Enlico
  • 23,259
  • 6
  • 48
  • 102
  • 1
    `sed -z` is a GNU extension, and not portable to other platforms. You could probably refactor the second `sed` script to work portably; or just replace it with `tr '\n' '|' | sed 's/|$//;s/|\\&/g'`. The whitespace in the first `sed` script should be a literal tab to work properly with tab-delimited input. – tripleee Jun 04 '21 at 08:52
  • Thanks for pointing out the tab-related bug. As regards the rest, your surely right. – Enlico Jun 04 '21 at 08:54
1

Many nice answers here, and a very good one from @triplee.

Just adding the 'in memory' bash way :

#!/bin/bash
search() {
  local patterns="$2"
  local string="$(gunzip -cd $1)"
  while IFS= read -r line; do
    local pattern="${line/[^$'\t']*$'\t'/}"
    local suffix="no"
    [ "${string/${pattern}/}" != "${string}" ] && suffix="yes"
    echo "${line} ${suffix}"
  done < "${patterns}"
}
search mess.txt animals.txt

The goal here is to limit I/O, one read from the gziped mess.txt, one read from animals and match in memory with strings patterns.

Zilog80
  • 2,534
  • 2
  • 15
  • 20
  • 1
    You want `read -r` to not mangle backslashes in the input, and probably `IFS=` to preserve whitespace. But keeping the entirely `gzip` file in memory seems deeply flawed, and doing this in pure shell is borderline madness anyway. – tripleee Jun 07 '21 at 05:34
  • @tripleee You're right, thanks, edited. That's mainly to show that it's possible, which does not imply that it's whishable ^^ – Zilog80 Jun 07 '21 at 08:09