1

I am using grep with a file that have multiple search patterns. As output I would like to get the matching pattern and the number of occurrences for that specific pattern.

cat pattern.txt

AT3G09260.1
AT5G50920.1

The input file looks like this

>AT2G44750.1 | Symbols: TPK2 | thiamin pyrophosphokinase 2 | chr2:18451510-18452754 FORWARD LENGTH=265
>AT2G47140.1 | Symbols:  | NAD(P)-binding Rossmann-fold superfamily protein | chr2:19350970-19352059 REVERSE LENGTH=257
>AT2G47120.1 | Symbols:  | NAD(P)-binding Rossmann-fold superfamily protein 
>AT1G21470.1 | Symbols:  | BEST Arabidopsis thaliana protein match is: CLPC homologue 1 (TAIR:AT5G50920.1); Has 326 Blast hits to 324 proteins in 95 species: Archae - 0; Bacteria - 130; Metazoa - 0; Fungi - 0; Plants - 67; Viruses - 0; Other Eukaryotes - 129 (source: NCBI BLink). | chr1:7516709-7517179 REVERSE LENGTH=118
>AT3G09260.1 | Symbols: PYK10, PSR3.1, BGLU23, LEB | Glycosyl hydrolase superfamily protein | chr3:2840657-2843730 REVERSE LENGTH=524
>AT5G48175.1 | Symbols:  | FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: endomembrane system; EXPRESSED IN: hypocotyl, male gametophyte, root; BEST Arabidopsis thaliana protein match is: Glycosyl hydrolase superfamily protein (TAIR:AT3G09260.1); Has 30201 Blast hits to 17322 proteins in 780 species: Archae - 12; Bacteria - 1396; Metazoa - 17338; Fungi - 3422; Plants - 5037; Viruses - 0; Other Eukaryotes - 2996 (source: NCBI BLink). | chr5:19539208-19539676 FORWARD LENGTH=115
>AT5G50920.1 | Symbols: CLPC, ATHSP93-V, HSP93-V, DCA1, CLPC1 | CLPC homologue 1 | chr5:20715710-20719800 REVERSE LENGTH=929

I would like to get something like

AT3G09260.1    2
AT5G50920.1    2

I have tried

grep -f pattern.txt -c inputfile.txt
4

but it only gives me the total number of matching lines (for all the patterns). I believe the question was already asked here but never solved

how to loop over pattern from a file with grep

Thank you.

marie
  • 63
  • 9

3 Answers3

3

You basically need grep -o which will print only the matched groups and then you can simply find their count using sort and uniq like this

$ grep -of pattern_file input_file | sort | uniq -c
      2 AT3G09260.1
      2 AT5G50920.1

If you want the order to be swapped then you can use awk like this :

$ grep -of pattern_file input_file | sort | uniq -c | awk '{print $2,$1}'
AT3G09260.1 2
AT5G50920.1 2

Or simply using awk

$ awk 'FNR==NR{a[$1]=0; next} { for(i in a) {a[i]+=gsub(i,"")} } END{for(i in a){ print i, a[i]} }' pattern_file RS= input_file
AT5G50920.1 2
AT3G09260.1 2
Rahul Verma
  • 2,946
  • 14
  • 27
0

Following awk could help you same, since your Input_file doesn't look to have multiple count of any line so couldn't test your output.

awk '{a[$0]++} END{for(i in a){print i,a[i]}}'  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
0

Try

grep -f pattern.txt inputfile.txt| cut -d'|' -f1 |sort | uniq -c

This will grep the matched lines from your file, then extract the ID (everything before the first pipe symbol, sort them and then count the unique occurrences of each.

user1717259
  • 2,717
  • 6
  • 30
  • 44
  • grep returns the entire line which can't be sorted easily so it won't work - will edit my question with the details of the input file to make it clearer – marie Oct 20 '17 at 10:41