9

I want to grep with patterns from file containing regex. When the pattern matches, it prints the matched stringa but not the pattern. How can I get the pattern instead matched strings?

pattern.txt

Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate
Donut Gorilla Chocolate
Chocolate (English|Fall) apple gorilla
gorilla chocolate (apple|ball)
(ball|donut) apple

strings.txt

apple ball Donut
donut ball chocolate
donut Ball Chocolate
apple donut
chocolate ball Apple

This is grep command

grep -Eix -f pattern.txt strings.txt

This command prints matched strings from strings.txt

apple ball Donut
donut ball chocolate
donut Ball Chocolate

But I want to find which patterns were used to match from pattern.txt

Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate

The pattern.txt can be lower cases, upper cases, line with regex and without, free numbers of words and regex elements. There is no other kind of regex than brackets and pipe.

I don't want to use loop to read pattern.txt each line to grep as it's slow. Is there way to print which pattern or line number of pattern file in grep command? or any other command than grep can do the job not too slow?

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
haru
  • 325
  • 2
  • 9

4 Answers4

5

Using grep I have no idea but with GNU awk:

$ awk '
BEGIN { IGNORECASE = 1 }      # for case insensitivity
NR==FNR {                     # process pattern file
    a[$0]                     # hash the entries to a
    next                      # process next line
}
{                             # process strings file
    for(i in a)               # loop all pattern file entries
        if($0 ~ "^" i "$") {  # if there is a match (see comments)
            print i           # output the matching pattern file entry
            # delete a[i]     # uncomment to delete matched patterns from a
            # next            # uncomment to end searching after first match
        }
}' pattern strings

outputs:

D (A|B) C

For each line in strings script will loop every pattern line to see if there are more than one match. There is only one match due to case-sensitivity. You can battle that, for example, using GNU awk's IGNORECASE.

Also, if you want each matched one pattern file entry to be outputed once, you could delete them from a after first match: add delete a[i] after the print. That might give you some performance advantage also.

James Brown
  • 36,089
  • 7
  • 43
  • 59
  • This code works, but matching with partial strings too. if I change "$0 ~ i" to "$0 == i" then it matches only entire string, but then problem to match with escaped characters... how can I write something like "=~ ^pattern$" in awk? – haru Aug 13 '18 at 14:25
  • 1
    Thanks, now it does what I need, this is like double speed of if I called grep by loop! – haru Aug 13 '18 at 14:34
  • You're welcome. I added the previous and `IGNORECASE` to the code. – James Brown Aug 13 '18 at 14:37
1

EDIT: Since OP changed the Input_file(s) so adding solutions as per changed Input_file(s) too now.

awk '
FNR==NR{
   a[toupper($1),toupper($NF)]
   b[toupper($2)]
   next
}
{
   val=toupper($2)
   gsub(/\)|\(|\|/," ",val)
   num=split(val,array," ")
   for(i=1;i<=num;i++){
      if(array[i] in b){
        flag=1
        break
      }
   }
}
flag && ((toupper($1),toupper($NF)) in a){
  print;
  flag=""
}' string pattern

Output will be as follows.

Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate


Solution 1st: Adding a generic solution where let's say your Input_file named pattern have more than 2 values on 2nd field eg--> (B|C|D|E) then following may help you here.

awk '
FNR==NR{
   a[$1,$NF]
   b[toupper($2)]
   next
}
{
   val=$2
   gsub(/\)|\(|\|/," ",val)
   num=split(val,array," ")
   for(i=1;i<=num;i++){
      if(array[i] in b){
        flag=1
        break
      }
   }
}
flag && (($1,$NF) in a)
{
  flag=""
}' string pattern


Solution 2nd: Could you please try following. But strictly considering that your Input_file(s) are same pattern as per shown samples only(where I am considering that your Input_file named pattern will have only 2 values in 2nd field of it)

awk '
FNR==NR{
  a[$1,$NF]
  b[toupper($2)]
  next
}
{
  val=$2
  gsub(/\)|\(|\|/," ",val)
  split(val,array," ")
}
((array[1] in b) || (array[2] in b)) && (($1,$NF) in a)
' string pattern

Output will be as follows.

A (B|C) D
D (A|B) C
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    This so high performance solution, but my example was bad, the actual file is more flexible, can be various word number, lower cases & upper cases, line without regex, actual bracket escaped and other symbols etc. – haru Aug 13 '18 at 14:00
  • Any of answers isn't working yet with my actual files, which are more complex text as explained above, I am trying modify your code but I'm still not getting the right result. Please let me edit the example. – haru Aug 13 '18 at 14:11
  • @haruka.k, please check my EDIT solution and let me know then? Previous one would have worked too, only thing problem was with small and capital letters. Check now, also try to up-vote people who are trying to help you and select an answer out of all as correct once you find it too, cheers – RavinderSingh13 Aug 13 '18 at 14:32
  • This code still doesn't work on mine and I'm having trouble to understand the details... when you split val by space to array, what happens if there is only one word in the line? – haru Aug 13 '18 at 15:04
  • @haruka.k, please post example here what do you mean by 1 word please? – RavinderSingh13 Aug 13 '18 at 15:50
  • @haruka.k, I could see my code is working perfectly fine with your given example, please do let me know what is not working and I will try to fix it then. – RavinderSingh13 Aug 13 '18 at 15:55
  • I found now in my work pattern file there are regex without space like pattern "aaa(bbb|ccc)ddd" string "aaabbbddd". is space necessary in your code? Could you possibly comment bit in your code to help me to understand what they are doing? – haru Aug 13 '18 at 16:46
  • @haruka.k, yes space is necessary in my code(as default field separator is space in `awk`) I am using that only. Will add explanation after dinner, do you have a file without spaces too? – RavinderSingh13 Aug 13 '18 at 16:50
0

You could try with bash built-ins:

$ cat foo.sh
#!/usr/bin/env bash

# case insensitive
shopt -s nocasematch

# associative array of patterns
declare -A patterns=()
while read -r p; do
    patterns["$p"]=1
done < pattern.txt

# read strings, test remaining patterns,
# if match print pattern and remove it from array    
while read -r s; do
    for p in "${!patterns[@]}"; do
        if [[ $s =~ ^$p$ ]]; then
            printf "%s\n" "$p"
            unset patterns["$p"]
        fi
    done
done < strings.txt
$ ./foo.sh
Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate

Not sure about the performance but as there are no child processes, it should be much faster than invoking grep for each pattern.

Of course, if you have millions of patterns, storing them in an associative array could exhaust your available memory.

Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
0

Maybe switch the paradigm?

while read pat
do grep -Eix "$pat" strings.txt >"$pat" &
done <patterns.txt

That's going to make ugly filenames, but you'd have clear lists per set. You could scrub the filenames first if you prefer. Maybe (assuming the patterns resolve to uniqueness this easily...)

while read pat
do grep -Eix "$pat" strings.txt >"${pat//[^A-Z]/}" &
done <patterns.txt

It ought to be reasonably quick, and is relatively simple to implement. Hope that helps.

Paul Hodges
  • 13,382
  • 1
  • 17
  • 36