Note:
Before I get down to business, I'd like to point out some other SO posts that didn't quite answer my question and are not duplicates of this one:
- How to grep with a list of words
- How to make grep only match if the entire line matches?
- how to grep for the whole word
- Grep extract only whole word
Background:
I have a list of words in a file called words.txt
(one word per line). I would like to find all lines from a different, much larger file called file.txt
that contain any of the words from words.txt
. However, I only want whole-word matches. This means that a match should be made when a line from file.txt
contains at least one instance where a word from words.txt
is found "all by itself" (I know this is vague, so allow me to explain).
In other words, a match should be made when:
- The word is all by itself on a line
- The word is surrounded by non-alphanumeric/non-hyphen characters
- The word is at the beginning of a line and followed by a non-alphanumeric/non-hyphen character
- The word is at the end of a line and preceded by a non-alphanumeric/non-hyphen character
For example, if one of the words in words.txt
is cat
, I would like it to behave as follows:
cat #=> match
cat cat cat #=> match
the cat is gray #=> match
mouse,cat,dog #=> match
caterpillar cat #=> match
caterpillar #=> no match
concatenate #=> no match
bobcat #=> no match
catcat #=> no match
cat100 #=> no match
cat-in-law #=> no match
Previous research:
There's a grep
command that almost suits my needs. It is as follows:
grep -wf words.txt file.txt
where the options are:
-w, --word-regexp
Select only those lines containing matches that form whole words.
The test is that the matching substring must either be at the beginning
of the line, or preceded by a non-word constituent character.
Similarly, it must be either at the end of the line or followed by a
non-word constituent character. Word-constituent characters are
letters, digits, and the underscore.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains
zero patterns, and therefore matches nothing.
The big problem I'm having with this is that it treats a hyphen (i.e. -
) as a "non-word constituent character". Therefore (based on the example above) doing a whole-word search for cat
will return cat-in-law
, which is not what I want.
I realize that the -w
option probably achieves the desired effect for many people. However, in my particular case, if a word (e.g. cat
) is followed/preceded by a hyphen, then I need to treat it as if it's part of a larger word (e.g. cat-in-law
) and not an instance of the word by itself.
Additionally, I know I could alter words.txt
to contain regular expressions instead of fixed strings and then use:
grep -Ef words.txt file.txt
where
-E, --extended-regexp
Interpret PATTERN as an extended regular expression
However, I would like to avoid altering words.txt
and keep it free of regex patterns.
Question:
Is there a simple bash command that will allow me to give it a list of words and perform whole-word matching on a body of text?