How to count the number of files each pattern of a group appears in a file?

Question

I am having problems when trying to count the number of times a specific pattern appears in a file (let's call it B). In this case, I have a file with 30 patterns (let's call it A), and I want to know how many lines contain that pattern.

With only one pattern it is quite simple:

grep "pattern" file | wc -l

But with a file full of them I am not able to figure out how it may work. I already tried this:

grep -f "fileA" "fileB" | wc -l

Nevertheless, it gives me the total times all patterns appear, not each one of them (that's what I desire to get).

Thank you so much.

Never use the word `pattern` when discussing matching text as it's highly ambiguous and likely to get you the wrong answer, always at a minimum state whatever combination of regexp-or-string, full-or-partial, word-or-line matching you need instead. See https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern for details then [edit] your question to replace `pattern` everywhere it occurs with the appropriate regexp/string + full/partial + word/line definition of what you're trying to match so we can help you. — Ed Morton, Nov 09 '21 at 22:40

Socowi · Answer 1 · 2021-11-09T20:26:33.950

Count matches per literal string

If you simply want to know how often each pattern appears and each of your pattern is a fixed string (not a regex), use ...

grep -oFf needles.txt haystack.txt | sort | uniq -c

Count matching lines per literal string

Note that above is slightly different from your formulation " I want to know how many lines contain that pattern" as one line can have multiple matches. If you really have to count matching lines per pattern instead of matches per pattern, then things get a little bit trickier:

grep -noFf needles.txt haystack.txt | sort | uniq | cut -d: -f2- | uniq -c

Count matching lines per regex

If the patterns are regexes, you probably have to iterate over the patterns, as grep's output only tells you that (at least) one pattern matched, but not which one.

# this will be very slow if you have many patterns
while IFS= read -r pattern; do
    printf '%8d %s\n' "$(grep -ce "$pattern" haystack.txt)" "$pattern"
done < needles.txt

... or use a different tool/language like awk or perl.

Note on overlapping matches

You did not formulate any precise requirements, so I went with the simplest solutions for each case. The first two solutions and the last solution behave differently in case multiple patterns match (part of) the same substring.

grep -f needles.txt matches each substring at most once. Therefore some matches might be "missed" (interpretation of "missed" depends on your requirements)
whereas iterating grep -e pattern1; grep -e pattern2; ... might match the same substring multiple times.

How to count the number of files each pattern of a group appears in a file?

1 Answers1

Count matches per literal string

Count matching lines per literal string

Count matching lines per regex

Note on overlapping matches