Using grep with pattern file to count individual pattern matches in a file

Question

I am using grep with a file that have multiple search patterns. As output I would like to get the matching pattern and the number of occurrences for that specific pattern.

cat pattern.txt

AT3G09260.1
AT5G50920.1

The input file looks like this

>AT2G44750.1 | Symbols: TPK2 | thiamin pyrophosphokinase 2 | chr2:18451510-18452754 FORWARD LENGTH=265
>AT2G47140.1 | Symbols:  | NAD(P)-binding Rossmann-fold superfamily protein | chr2:19350970-19352059 REVERSE LENGTH=257
>AT2G47120.1 | Symbols:  | NAD(P)-binding Rossmann-fold superfamily protein 
>AT1G21470.1 | Symbols:  | BEST Arabidopsis thaliana protein match is: CLPC homologue 1 (TAIR:AT5G50920.1); Has 326 Blast hits to 324 proteins in 95 species: Archae - 0; Bacteria - 130; Metazoa - 0; Fungi - 0; Plants - 67; Viruses - 0; Other Eukaryotes - 129 (source: NCBI BLink). | chr1:7516709-7517179 REVERSE LENGTH=118
>AT3G09260.1 | Symbols: PYK10, PSR3.1, BGLU23, LEB | Glycosyl hydrolase superfamily protein | chr3:2840657-2843730 REVERSE LENGTH=524
>AT5G48175.1 | Symbols:  | FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: endomembrane system; EXPRESSED IN: hypocotyl, male gametophyte, root; BEST Arabidopsis thaliana protein match is: Glycosyl hydrolase superfamily protein (TAIR:AT3G09260.1); Has 30201 Blast hits to 17322 proteins in 780 species: Archae - 12; Bacteria - 1396; Metazoa - 17338; Fungi - 3422; Plants - 5037; Viruses - 0; Other Eukaryotes - 2996 (source: NCBI BLink). | chr5:19539208-19539676 FORWARD LENGTH=115
>AT5G50920.1 | Symbols: CLPC, ATHSP93-V, HSP93-V, DCA1, CLPC1 | CLPC homologue 1 | chr5:20715710-20719800 REVERSE LENGTH=929

I would like to get something like

AT3G09260.1    2
AT5G50920.1    2

I have tried

grep -f pattern.txt -c inputfile.txt
4

but it only gives me the total number of matching lines (for all the patterns). I believe the question was already asked here but never solved

how to loop over pattern from a file with grep

Thank you.

why did write *but never solved* ? that question has been answered — RomanPerekhrest, Oct 20 '17 at 10:21

Rahul Verma · Accepted Answer · 2017-10-20T11:24:24.620

3

You basically need grep -o which will print only the matched groups and then you can simply find their count using sort and uniq like this

$ grep -of pattern_file input_file | sort | uniq -c
      2 AT3G09260.1
      2 AT5G50920.1

If you want the order to be swapped then you can use awk like this :

$ grep -of pattern_file input_file | sort | uniq -c | awk '{print $2,$1}'
AT3G09260.1 2
AT5G50920.1 2

Or simply using awk

$ awk 'FNR==NR{a[$1]=0; next} { for(i in a) {a[i]+=gsub(i,"")} } END{for(i in a){ print i, a[i]} }' pattern_file RS= input_file
AT5G50920.1 2
AT3G09260.1 2

edited Oct 20 '17 at 11:24

answered Oct 20 '17 at 11:01

Rahul Verma

2,946
14
27

1

Great, both grep and awk working perfectly. Thank you. – marie Oct 20 '17 at 12:24

score 0 · Answer 2 · answered Oct 20 '17 at 10:21

0

Following awk could help you same, since your Input_file doesn't look to have multiple count of any line so couldn't test your output.

awk '{a[$0]++} END{for(i in a){print i,a[i]}}'  Input_file

answered Oct 20 '17 at 10:21

RavinderSingh13

130,504
14
57
93

user1717259 · Answer 3 · 2017-10-20T10:57:34.923

0

Try

grep -f pattern.txt inputfile.txt| cut -d'|' -f1 |sort | uniq -c

This will grep the matched lines from your file, then extract the ID (everything before the first pipe symbol, sort them and then count the unique occurrences of each.

edited Oct 20 '17 at 10:57

answered Oct 20 '17 at 10:26

user1717259

2,717
6
30
44

grep returns the entire line which can't be sorted easily so it won't work - will edit my question with the details of the input file to make it clearer – marie Oct 20 '17 at 10:41

Using grep with pattern file to count individual pattern matches in a file

3 Answers3