1

I've seen this question: Regular expression to match a line that doesn't contain a word?

But I can't get it to work. I have a shell script and I'm using

string1.*string2.*string3

To search for 3 words in a file, in that order. But I want to change it so that if badword5 is anywhere in between those words in that file, there is no regex match with grep.

So this should match:

./testing/test.txt:   let prep = "select string1, dog from cat",
          " where apple = 1",
          " and string2 = 2",
          " and grass = 8",
          " and string3 = ?"

But this should not:

   ./testing/test.txt:   let prep = "select string1, dog from cat",
          " where apple = 1",
          " and string2 = 2",
          " and grass = 8",
          " and badword5 = 4", 
          " and string3 = ?"

I unsuccessfully tried:

string1((?!badword5)|.)*string2((?!badword5)|.)*string3

The entire script:

find . -name "$file_to_check" 2>/null | while read $FILE
do
   tr '\n' ' ' <"$FILE" | if grep -q "string1.*string2.*string3"; then echo "$FILE" ; fi
done >> $grep_out
Community
  • 1
  • 1
user1472747
  • 529
  • 3
  • 10
  • 25

2 Answers2

1

"To search for 3 words in a file, in that order. But I want to change it so that if badword5 is anywhere in between those words in that file, there is no regex match with grep."

Indeed, and the search pattern stretches multiple lines.
let's drop grep for the moment and try something different:

#!/bin/bash

find . -name "$file_to_check" 2>/dev/null | while read FILE
do
    SCORE=0
    tr ' ' '\n' <"$FILE" | while read WORD
    do
        case $WORD in
        "word1"    ) [ $SCORE = 0 ] && SCORE=1               ;;
        "word2"    ) [ $SCORE = 1 ] && SCORE=2               ;;
        "word3"    ) [ $SCORE = 2 ] && echo "$FILE" && break ;;
        "badword5" ) SCORE=0                                 ;;
        esac
    done        
done >grep_out

the case lines do the following thing:

"    word1"      )    [ $SCORE     =       0 ] &&      SCORE  =       1  ;;
when word1 is found: and SCORE is equal to 0 then make SCORE equal to 1
when word2 is found: and SCORE is equal to 1 then make SCORE equal to 2
when word3 is found: and SCORE is equal to 2 then print filename and break out of the inner loop.
thom
  • 2,294
  • 12
  • 9
  • if this works for you then I can make one that accepts all the words on the commandline, including "!bad" ones . – thom Dec 02 '13 at 18:44
  • Nice approach! I'm trying it now, it's taking a long time though. I expect it to, since the filesystem is large. I think I can manage to setup the command line arguments myself, though I appreciate the offer. I'm just totally stuck on this search thing. – user1472747 Dec 02 '13 at 18:49
  • when I have to do a fast search on a large filesystem, I always search in stages: a first recursive grep for one of the words, that shortens the list of files a lot. Let's see if I can think of something faster. If I know something I will add it as another answer. – thom Dec 02 '13 at 18:57
  • I think the speed of this alright, actually, but it doesn't seem to be working. Also, are the args to your 'tr' command backwards? I tried it both ways but you had them the other way around in the other thread, and that worked perfectly before. – user1472747 Dec 02 '13 at 19:15
  • No the args are not backwards, this time I replaced every space with a newline to transform the text to a "word list" (one word per line). I will take another look at this script. either I made an error or I didn't fully understand the question. – thom Dec 02 '13 at 19:21
  • Can you explain what each case is doing with the square brackets and the double ampersand? I think I understand what you're doing here, I just don't understand the syntax. – user1472747 Dec 02 '13 at 19:21
  • I updated my answer with an explanation :-) The square brackets are an evaluation (compare), the '&&' means: if previous argument was ok then do next argument. – thom Dec 02 '13 at 19:30
  • btw, i found the error. I made a typo: `/dir1/null` instead of `/dev/null` . I updated the answer, it should be OK now. – thom Dec 02 '13 at 19:40
  • Got it! The problem was that $SCORE was never initialized to 0. It worked with my small test file - now I'm trying the entire filesystem. I will report back when it's done. – user1472747 Dec 02 '13 at 20:25
  • It took about 3 days to scan ~1000 files each at 10k lines long. I'll try your method of searching for general keywords before doing the scoring. Thanks for the help! – user1472747 Dec 06 '13 at 13:07
0

You can use grep -v to skip a line for badword5:

tr '\n' ' ' < "$FILE" | grep -v 'badword5' | if grep -q "string1.*string2.*string3"; then echo "$FILE" ; fi
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • I thought that using -v wouldn't work, because our search is spanning multiple lines, plus, it would search the entire file for badword5, not just within our pattern. I could be wrong. – user1472747 Dec 02 '13 at 18:36
  • @anubhava: you probably mean `grep -v 'badword5' ||` ... Which wouldn't execute further commands. As far as I understand, any occurrence of badword5 wouldn't match either in between any words or not. – Valentin H Dec 02 '13 at 20:48