Whole-word matching on a body of text, given a list of words

Question

Note:

Before I get down to business, I'd like to point out some other SO posts that didn't quite answer my question and are not duplicates of this one:

Background:

I have a list of words in a file called words.txt (one word per line). I would like to find all lines from a different, much larger file called file.txt that contain any of the words from words.txt. However, I only want whole-word matches. This means that a match should be made when a line from file.txt contains at least one instance where a word from words.txt is found "all by itself" (I know this is vague, so allow me to explain).

In other words, a match should be made when:

The word is all by itself on a line
The word is surrounded by non-alphanumeric/non-hyphen characters
The word is at the beginning of a line and followed by a non-alphanumeric/non-hyphen character
The word is at the end of a line and preceded by a non-alphanumeric/non-hyphen character

For example, if one of the words in words.txt is cat, I would like it to behave as follows:

cat              #=> match
cat cat cat      #=> match
the cat is gray  #=> match
mouse,cat,dog    #=> match
caterpillar cat  #=> match
caterpillar      #=> no match
concatenate      #=> no match
bobcat           #=> no match
catcat           #=> no match
cat100           #=> no match
cat-in-law       #=> no match

Previous research:

There's a grep command that almost suits my needs. It is as follows:

grep -wf words.txt file.txt

where the options are:

-w, --word-regexp
       Select only those lines containing matches that form whole words.
       The test is that the matching substring must either be at the beginning
       of the line, or preceded by a non-word constituent character.
       Similarly, it must be either at the end of the line or followed by a
       non-word constituent character. Word-constituent characters are
       letters, digits, and the underscore.
-f FILE, --file=FILE
       Obtain patterns from FILE, one per line. The empty file contains
       zero patterns, and therefore matches nothing.

The big problem I'm having with this is that it treats a hyphen (i.e. -) as a "non-word constituent character". Therefore (based on the example above) doing a whole-word search for cat will return cat-in-law, which is not what I want.

I realize that the -w option probably achieves the desired effect for many people. However, in my particular case, if a word (e.g. cat) is followed/preceded by a hyphen, then I need to treat it as if it's part of a larger word (e.g. cat-in-law) and not an instance of the word by itself.

Additionally, I know I could alter words.txt to contain regular expressions instead of fixed strings and then use:

grep -Ef words.txt file.txt

where

-E, --extended-regexp
              Interpret PATTERN as an extended regular expression

However, I would like to avoid altering words.txt and keep it free of regex patterns.

Question:

Is there a simple bash command that will allow me to give it a list of words and perform whole-word matching on a body of text?

If you have a big list of constant words, use the ternary tool to generate a trie regex. Its much faster than simple alternations, and you can pick the boundary you need, screenshot [here](http://www.regexformat.com/default_files/Rx5_ScrnSht01.jpg). Take a look at how they made dictionaries into a single regex. The trial version is free. If its ever changing word list (dynamic), this won't work. — , May 26 '15 at 23:34

score 5 · Accepted Answer · answered May 26 '15 at 22:56

I finally came up with a solution:

grep -Ef <(awk '{print "([^a-zA-Z0-9-]|^)"$0"([^a-zA-Z0-9-]|$)"}' words.txt) file.txt

Explanation:

words.txt is my list of words (one per line).
file.txt is the body of text that I would like to search.
The awk command will preprocess words.txt on-the-fly, wrapping each word in a special regular expression to define its official beginning and ending (based on the specifications posted in my question above).
The awk command is surrounded by <( and ) so that its output is used as the input for the -f option.
I'm using the -E option because I'm now inputting a list of regular expressions instead of fixed strings from words.txt.

The nice thing here is that words.txt can remain human-readable and doesn't have to contain a bunch of regex patterns.

Just want to say thank you. I've wanted some solution like this from time to time, and finally found one. — CloudyTrees, Apr 11 '20 at 14:05

Whole-word matching on a body of text, given a list of words

1 Answers1

Linked