For every line in a file, determine if string is present within another file

Question

I have a tab delimited text file (animals.txt) with five columns:

302947298 2340974238 0 0 cat
345098948 8345988989 0 0 dog
098982388 2098340923 0 0 fish
932840923 0923840988 0 0 parrot

I have another file, mess.txt.gz, which is compressed using GNU zip (.gz file). It basically looks like a massive string of letters:

sdihfoiahdfosparrotdhiafoihsdfoijaslkdogoieufoiweuf

Basically, for every line in the tab delimited text file, I want to see if any of the animal names are present within this .gz file.

Ideally, it would return something like this:

302947298 2340974238 0 0 cat no
345098948 8345988989 0 0 dog yes
098982388 2098340923 0 0 fish no
932840923 0923840988 0 0 parrot yes

At the moment I am doing the following:

gunzip -cd mess.txt.gz | grep cat
gunzip -cd mess.txt.gz | grep dog

To automate it, I've tried the following:

cat animals.txt | awk '{print $5}' > animal_names.txt

cat animal_names.txt | while read line 
do
   gunzip -cd mess.txt.gz | grep $line > output.txt
done

I've also tried:

cat animal_names.txt | while read line 
do
   if [ gunzip -cd mess.txt.gz | grep $line ]
   then
     echo "Yes"
   else
     echo "No"
   fi
   ; do
done > output.txt

What is the best way to do this in bash?

As an aside, `if [ gunzip -cd mess.txt.gz | grep $line ]` is a syntax error; even if it wasn't, it just checks whether `gunzip` etc is not an empty string, which of course it isn't. Perhaps see also https://stackoverflow.com/questions/36371221/checking-the-success-of-a-command-in-a-bash-if-statement — tripleee, Jun 04 '21 at 08:46
Perhaps see also why you may want to avoid a [useless use of `cat`](https://stackoverflow.com/questions/11710552/useless-use-of-cat) — tripleee, Jun 04 '21 at 08:48

tripleee · Accepted Answer · 2021-06-04T08:44:56.153

You can pass all the search strings to zgrep -Ff - in one pass:

cut -f5 animals.txt |
zgrep -Ff - mess.txt.gz

The -F option says to look for literal strings, not regular expressions (avoids false positives if the input contains dots or other regex metacharacters, and besides, will be significantly faster) and -f - says to read the search patterns from standard input (i.e. from the pipe from cut).

If you want a list of the matched animals, add an -o option and a brief postprocessing step;

cut -f5 animals.txt |
zgrep -Ff - -o mess.txt.gz |
sort | uniq -c

You can replace | uniq -c with just -u if you don't care how many there were of each.

This works as intended on Linux with GNU grep, but macOS (and thus probably generally *BSD) grep -o only prints the first match in each input line when combined with -f -. If you need *BSD portability, I'd go with either of the other solutions here (currently there's one for sed and one for Awk).

This would show, which animals are in mess.txt, but is missing the _yes/no_ information the OP wants to know. Of course this is a mere cosmetic issue; such a table could be generated easily afterwards, based on the output of zgrep. — user1934428, Jun 04 '21 at 08:35
@user1934428 Thanks; added a brief postprocessing step which doesn't do exactly that, but at least should offer some direction. — tripleee, Jun 04 '21 at 08:38

score 1 · Answer 2 · answered Jun 04 '21 at 08:27

1

You may use this awk solution with gzcat:

awk 'BEGIN{FS=OFS="\t"} FNR==NR {s=s $0; next} {print $0, (index(s, $NF) > 1 ? "yes" : "no")}' <(gzcat mess.txt.gz) animals.txt

302947298  2340974238  0  0  cat     no
345098948  8345988989  0  0  dog     yes
098982388  2098340923  0  0  fish    no
932840923  0923840988  0  0  parrot  yes

A more readable form:

awk '
BEGIN {FS=OFS="\t"}
FNR == NR {
   s = s $0
   next
}
{
   print $0, (index(s, $NF) > 1 ? "yes" : "no")
}
' <(gzcat mess.txt.gz) animals.txt

answered Jun 04 '21 at 08:27

anubhava

761,203
64
569
643

I like this solution but when I tried it on my files it returned "no" for every option – icedcoffee Jun 04 '21 at 08:38
I tested it on your provided sample data and got this output shown in my answer – anubhava Jun 04 '21 at 08:40
Apologies, I was using zcat instead of gzcat. – icedcoffee Jun 04 '21 at 08:43
1

On some systems zcat works fine with gz files but I am on OSX so had to use gzcat – anubhava Jun 04 '21 at 08:50

Enlico · Answer 3 · 2021-06-04T08:53:38.157

1

What about this?

gunzip -cd mess.txt.gz | grep "$(< animals.txt sed -e 's/.*\t//' | sed -z 's/\n/\\|/g;s/\\|$//')"

It is basically the version of your

gunzip -cd mess.txt.gz | grep dog

where, instead of dog, the regex dog\|cat\|whatever is generated from the file animals.txt.

My command should give you the output that you get with the example you write after

To automate it, I've tried the following:

with which you don't end up with the result you refer to as ideal.

edited Jun 04 '21 at 08:53

answered Jun 04 '21 at 08:27

Enlico

23,259
6
48
102

1

`sed -z` is a GNU extension, and not portable to other platforms. You could probably refactor the second `sed` script to work portably; or just replace it with `tr '\n' '|' | sed 's/|$//;s/|\\&/g'`. The whitespace in the first `sed` script should be a literal tab to work properly with tab-delimited input. – tripleee Jun 04 '21 at 08:52
Thanks for pointing out the tab-related bug. As regards the rest, your surely right. – Enlico Jun 04 '21 at 08:54

Zilog80 · Answer 4 · 2021-06-07T08:17:47.323

1

Many nice answers here, and a very good one from @triplee.

Just adding the 'in memory' bash way :

#!/bin/bash
search() {
  local patterns="$2"
  local string="$(gunzip -cd $1)"
  while IFS= read -r line; do
    local pattern="${line/[^$'\t']*$'\t'/}"
    local suffix="no"
    [ "${string/${pattern}/}" != "${string}" ] && suffix="yes"
    echo "${line} ${suffix}"
  done < "${patterns}"
}
search mess.txt animals.txt

The goal here is to limit I/O, one read from the gziped mess.txt, one read from animals and match in memory with strings patterns.

edited Jun 07 '21 at 08:17

answered Jun 04 '21 at 09:17

Zilog80

2,534
2
15
20

1

You want `read -r` to not mangle backslashes in the input, and probably `IFS=` to preserve whitespace. But keeping the entirely `gzip` file in memory seems deeply flawed, and doing this in pure shell is borderline madness anyway. – tripleee Jun 07 '21 at 05:34
@tripleee You're right, thanks, edited. That's mainly to show that it's possible, which does not imply that it's whishable ^^ – Zilog80 Jun 07 '21 at 08:09

For every line in a file, determine if string is present within another file

4 Answers4