31

I want to check if all of my strings exist in a text file. They could exist on the same line or on different lines. And partial matches should be OK. Like this:

...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on

In the above example, we could have regexes in place of strings.

For example, the following code checks if any of my strings exists in the file:

if grep -EFq "string1|string2|string3" file; then
  # there is at least one match
fi

How to check if all of them exist? Since we are just interested in the presence of all matches, we should stop reading the file as soon all strings are matched.

Is it possible to do it without having to invoke grep multiple times (which won't scale when input file is large or if we have a large number of strings to match) or use a tool like awk or python?

Also, is there a solution for strings that can easily be extended for regexes?

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
codeforester
  • 39,467
  • 16
  • 112
  • 140
  • 8
    Duplicate question: https://unix.stackexchange.com/questions/55359/how-to-run-grep-with-multiple-and-patterns – Ian McGowan Apr 10 '18 at 21:13
  • 5
    @IanMcGowan: that's not a dupe. That post talks about the all patterns occurring together on the same line. My interest is to find them on the same line or on different lines. – codeforester Apr 11 '18 at 06:39
  • 2
    Ah, good point, my bad. I should have added a question mark. – Ian McGowan Apr 11 '18 at 18:48
  • Do you have some typical, practical numbers in mind like numbers of patterns (4, 16, 64, ...) and files to search (10, 1000, 100000)? – user unknown Apr 13 '18 at 11:28
  • @userunknown: up to a few thousand patterns is reasonable. – codeforester Apr 13 '18 at 21:36
  • 1
    @codeforester: Wow, few thousands, that seems an important information. For such big numbers, it might get problematic to pass them as arguments via the shell. Maybe it is possible to split them into chunks of some hundreds of parameters, filter a filelist and get a filtered filelist from the first invocation, to produce a reduced filelist as input for the second invocation with the next chunk and so on. Instead of passing the files and patterns as paremters, one could pass filenames containing these, with few modifications this would be possible with my, but surely with other scripts too. – user unknown Apr 13 '18 at 21:59
  • 1
    Related: [How to find patterns across multiple lines using grep?](https://stackoverflow.com/q/2686147/55075) & [How can I search for a multiline pattern in a file?](https://stackoverflow.com/q/152708/55075) – kenorb Apr 14 '18 at 23:39
  • 1
    I'm completely aghast at how many times I see the word "pattern" in this Q&A and all of the answers that search for regexps using the word "string" instead. Saying you want to find a "pattern" is like losing your dog and putting up flyers that just say `"lost: 1 animal"` and saying you're working with a string when it's really a regexp is like handing someone a cat and trying to convince them it's really their missing dog! – Ed Morton Apr 15 '18 at 19:38
  • 2
    Why does this have an `awk` and `python` tag if it asks about doing it without them? – Gert van den Berg Apr 16 '18 at 10:32

21 Answers21

23

Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.

In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:

awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file

And here's a bunch of other information and options:

Assuming you're really looking for strings, it'd be:

awk -v strings='string1 string2 string3' '
BEGIN {
    numStrings = split(strings,tmp)
    for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
    for (str in strs) {
        if ( index($0,str) ) {
            delete strs[str]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file

the above will stop reading the file as soon as all strings have matched.

If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:

awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file

Actually, even if it were strings you could do:

awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file

The main issue with the above 2 GNU awk solutions is that, like @anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.

I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:

awk '
NR==FNR { strings[$0]; next }
{
    for (string in strings)
        if ( !index($0,string) )
            exit 1
}
' file_of_strings RS='^$' file_to_be_searched

and for regexps it'd be:

awk '
NR==FNR { regexps[$0]; next }
{
    for (regexp in regexps)
        if ( $0 !~ regexp )
            exit 1
}
' file_of_regexps RS='^$' file_to_be_searched

If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0' instead of RS='^$' or by appending to variable one line at a time as it's read and then processing that variable in the END section.

If your file_to_be_searched is too large to fit in memory then it'd be this for strings:

awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
    for (string in strings) {
        if ( index($0,string) ) {
            delete strings[string]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched

and the equivalent for regexps:

awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
    for (regexp in regexps) {
        if ( $0 ~ regexp ) {
            delete regexps[regexp]
            numRegexps--
        }
    }
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 2
    This is a good answer. Started a bounty to attract even more attention. – codeforester Apr 12 '18 at 22:10
  • 1
    Ok but you really should clarify if you want a solution for strings or regexps and whether you need "full word" matches only or partial matches. – Ed Morton Apr 12 '18 at 22:35
  • 1
    Good. Really think long and hard about what each solution is doing as your posted sample input doesn't cover all of the cases it needs to to prove if a solution works or not so while all solutions will produce the expected output given that sample input, most of them will fail given different input. So either take your time to figure out different sample input that covers all of your requirements (in particular files that should not match but are hard to handle) or really think through the logic of each solution to make sure it does address **all** of your needs. – Ed Morton Apr 13 '18 at 13:40
  • 1
    Oh, and also make sure you understand [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) if you're considering any solution that involves a shell loop, and consider what the given solution will do wrt memory and speed of execution for a huge file. – Ed Morton Apr 13 '18 at 13:41
  • 4
    @EdMorton Very nice answer! I am missing only one thing, what if the searched string is split by a newline. Imagine a textbook and the string you want to search for is broken apart by a newline. It is a bit more difficult but it could be a nice addition to this answer! – kvantour Apr 14 '18 at 21:10
  • Thanks, but there's nothing special about a newline, you just treat it like any other character. If the OP wants to search for a string that contains a newline then they just have to set the RS to whatever is the record separator instead of a newline (e.g. `\0` or `^$` as I show above to search the whole file or maybe `""` if blank lines separate the records or `[[:punct:]]` or something else) and include a newline in the string that's being searched with, just like any other character. – Ed Morton Apr 15 '18 at 06:10
22

git grep

Here is the syntax using git grep with multiple patterns:

git grep --all-match --no-index -l -e string1 -e string2 -e string3 file

You may also combine patterns with Boolean expressions such as --and, --or and --not.

Check man git-grep for help.


--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.

--no-index Search files in the current directory that is not managed by Git.

-l/--files-with-matches/--name-only Show only the names of files.

-e The next parameter is the pattern. Default is to use basic regexp.

Other params to consider:

--threads Number of grep worker threads to use.

-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.

To change the pattern type, you may also use -G/--basic-regexp (default), -F/--fixed-strings, -E/--extended-regexp, -P/--perl-regexp, -f file, and other.

Community
  • 1
  • 1
kenorb
  • 155,785
  • 88
  • 678
  • 743
  • 1
    This reports multiple matches per file. But with `time git grep --name-only --all-match -e ...` it only reports the matching files and it is impressive fast - my solution was about 0.2s for my testcase, yours took 0.02s. And the option '-f patternfile` might be interesting too. Most important maybe: It is simple, clear and compact and surely gets over time more support than my homegrown script. – user unknown Apr 14 '18 at 23:54
  • 3
    And this is why SO is so great. Everybody has different ways of approaching a problem and from time to time something unexpected pops up. – kvantour Apr 16 '18 at 16:42
  • 1
    Best I can tell `git grep -q --all-match --no-index -l -F -f strings.txt file` seems like it'd work just fine as long as you don't mind installing git tools. Does it read the whole file into memory at one time or does it store 1 line at a time in memory? Is there any limit on how many strings can be searched for at one time? – Ed Morton Apr 17 '18 at 13:38
  • 1
    @EdMorton There are no string limits, it'll just take longer. In terms of memory management, I'm not sure, you can check the [source code of `grep.c`](https://github.com/git/git/blob/master/grep.c) (look for `all_match`). But in overall, it's designed to be quick. – kenorb Apr 17 '18 at 13:46
  • 1
    Thanks, best I can tell from reading that source file it always reads the whole input file into memory. – Ed Morton Apr 17 '18 at 14:06
  • This is an excellent answer. Thank you to share! However, I found a strange issue. If `` has one or more absolute paths, git will crash with an error like: `BUG: environment.c:202: git environment hasn't been setup` However, relative paths do not cause this issue. I am using `git grep` to search multiple absolute paths with multiple logic-and regexp. Can this be done? – kevinarpe Dec 24 '21 at 13:28
6

This gnu-awk script may work:

cat fileSearch.awk
re == "" {
   exit
}
{
   split($0, null, "\\<(" re "\\>)", b)
   for (i=1; i<=length(b); i++)
      gsub("\\<" b[i] "([|]|$)", "", re)
}
END {
   exit (re != "")
}

Then use it as:

if awk -v re='string1|string2|string3' -f fileSearch.awk file; then
   echo "all strings were found"
else
   echo "all strings were not found"
fi

Alternatively, you can use this gnu grep solution with PCRE option:

grep -qzP '(?s)(?=.*\bstring1\b)(?=.*\bstring2\b)(?=.*\bstring3\b)' file
  • Using -z we make grep read complete file into a single string.
  • We are using multiple lookahead assertions to assert that all the strings are present in the file.
  • Regex must use (?s) or DOTALL mod to make .* match across the lines.

As per man grep:

-z, --null-data
   Treat  input  and  output  data as sequences of lines, each terminated by a 
   zero byte (the ASCII NUL character) instead of a newline.
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 2
    No `grep` solution works irrespective of any order of appearance of those strings in the file. Lookahead assertions just make sure those strings are present anywhere in the file. So `grep` can also be written as: `grep -qzP '(?s)(?=.*\bstring3\b)(?=.*\bstring1\b)(?=.*\bstring2\b)' file` – anubhava Apr 12 '18 at 14:13
  • 1
    Indeed, your `gnu-awk` solution is neat – anubhava Apr 12 '18 at 14:25
4

First, you probably want to use awk. Since you eliminated that option in the question statement, yes, it is possible to do and this provides a way to do it. It is likely MUCH slower than using awk, but if you want to do it anyway...

This is based on the following assumptions:G

  • Invoking AWK is unacceptable
  • Invoking grep multiple times is unacceptable
  • The use of any other external tools are unacceptable
  • Invoking grep less than once is acceptable
  • It must return success if everything is found, failure when not
  • Using bash instead of external tools is acceptable
  • bash version is >= 3 for the regular expression version

This might meet all of your requirements: (regex version miss some comments, look at string version instead)

#!/bin/bash

multimatch() {
    filename="$1" # Filename is first parameter
    shift # move it out of the way that "$@" is useful
    strings=( "$@" ) # search strings into an array

    declare -a matches # Array to keep track which strings already match

    # Initiate array tracking what we have matches for
    for ((i=0;i<${#strings[@]};i++)); do
        matches[$i]=0
    done

    while IFS= read -r line; do # Read file linewise
        foundmatch=0 # Flag to indicate whether this line matched anything
        for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
            if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
                string="${strings[$i]}" # fetch the string
                if [[ $line = *$string* ]]; then # check if it matches
                    matches[$i]=1   # mark that we have found this
                    foundmatch=1    # set the flag, we need to check whether we have something left
                fi
            fi
        done
        # If we found something, we need to check whether we
        # can stop looking
        if [ "$foundmatch" -eq 1 ]; then
            somethingleft=0 # Flag to see if we still have unmatched strings
            for ((i=0;i<${#matches[@]};i++)); do
                if [ "${matches[$i]}" -eq 0 ]; then
                    somethingleft=1 # Something is still outstanding
                    break # no need check whether more strings are outstanding
                fi
            done
            # If we didn't find anything unmatched, we have everything
            if [ "$somethingleft" -eq 0 ]; then return 0; fi
        fi
    done < "$filename"

    # If we get here, we didn't have everything in the file
    return 1
}

multimatch_regex() {
    filename="$1" # Filename is first parameter
    shift # move it out of the way that "$@" is useful
    regexes=( "$@" ) # Regexes into an array

    declare -a matches # Array to keep track which regexes already match

    # Initiate array tracking what we have matches for
    for ((i=0;i<${#regexes[@]};i++)); do
        matches[$i]=0
    done

    while IFS= read -r line; do # Read file linewise
        foundmatch=0 # Flag to indicate whether this line matched anything
        for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
            if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
                regex="${regexes[$i]}" # Get regex from array
                if [[ $line =~ $regex ]]; then # We use the bash regex operator here
                    matches[$i]=1   # mark that we have found this
                    foundmatch=1    # set the flag, we need to check whether we have something left
                fi
            fi
        done
        # If we found something, we need to check whether we
        # can stop looking
        if [ "$foundmatch" -eq 1 ]; then
            somethingleft=0 # Flag to see if we still have unmatched strings
            for ((i=0;i<${#matches[@]};i++)); do
                if [ "${matches[$i]}" -eq 0 ]; then
                    somethingleft=1 # Something is still outstanding
                    break # no need check whether more strings are outstanding
                fi
            done
            # If we didn't find anything unmatched, we have everything
            if [ "$somethingleft" -eq 0 ]; then return 0; fi
        fi
    done < "$filename"

    # If we get here, we didn't have everything in the file
    return 1
}

if multimatch "filename" string1 string2 string3; then
    echo "file has all strings"
else
    echo "file miss one or more strings"
fi

if multimatch_regex "filename" "regex1" "regex2" "regex3"; then
    echo "file match all regular expressions"
else
    echo "file does not match all regular expressions"
fi

Benchmarks

I did some benchmarking searching .c,.h and .sh in arch/arm/ from Linux 4.16.2 for the strings "void", "function", and "#define". (Shell wrappers were added/ the code tuned that all can be called as testname <filename> <searchstring> [...] and that an if can be used to check the result)

Results: (measured with time, real time rounded to closest half second)

(Invoking grep multiple times, especially with the recursive method, did better than I expected)

Gert van den Berg
  • 2,448
  • 31
  • 41
  • 2
    Some benchmarking (say on the scala file example) would be interesting... It is likely significantly slower than AWK - it is mainly an exercise to show that the stated requirements can be met... (Speed was not mentioned as a requirement) (This also seems to use no external processes - that can be good for speed, but bash text processing, which can be bad for speed, at least compared to C code in grep...) – Gert van den Berg Apr 13 '18 at 16:25
  • 1
    Yes, you're right, it would be several orders of magnitude slower than an awk script. See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice). You use the word `pattern` in several places in your script - that's a highly ambiguous term which as such should generally be avoided so please clarify each time you use it if you are talking about a `string`, a `regexp`, a `globbing pattern` or something else. – Ed Morton Apr 15 '18 at 06:20
  • @EdMorton: It seems to be the only way to avoid standard tools (like AWK) (or non-standard tools, like Perl) (I'm working on the assumption that the only acceptable non-shell tool is `grep`, invoked only once...) (The question asks whether it is "possible", not whether it is a good idea ;-) ) "Pattern" was used for the search string (which can either be a regex or a plain string matched with globbing, depending on the variant) (The regex version was done once I realised that bash had built-in regexes, and is a simple modification of the first version) (detailed comments was added after fork) – Gert van den Berg Apr 16 '18 at 07:31
  • @EdMorton: A lot (but not all) of the criticism of the read method is invoking external tools multiple times, which this avoids. (It does need a blank IFS to keep it from stripping leading whitespace (I know that with a single variable, whitespace between terms are not affected by `$IFS`)) – Gert van den Berg Apr 16 '18 at 07:37
  • 1
    Thanks for doing the benchmarking. Since the OP is searching for thousands of strings could you try again with a large number of strings (e.g. at least 1,000, all of which appear in the target file,and some of which are subsets of each other and some of which contain regexp metachars)? There's huge differences between how the various solutions will perform as the number of strings to be searched for gets large (and match) plus some solutions will fail given strings that are substrings of other strings or strings that contain RE chars and those differences won't show up for just those 3 strings – Ed Morton Apr 17 '18 at 13:57
  • 2
    @EdMorton: It becomes tricky - many of the solutions have different interfaces that becomes tricky (and possibly slow) to map. The CLI ones might need a non-shell method to invoke them to build a `argv` close to `ARG_MAX` (Although they can be arbitrarily combined if they have proper exit codes with `&&` operators, with the disadvantage of multiple scans of the file if the first part matches) – Gert van den Berg Apr 17 '18 at 14:30
  • (The output also needs to be checked) (Those strings (and extensions) on the test set was mainly chosen that some files relatively quickly match all strings (to disadvantage things that scan entire files multiple times irrespective of matches) and have some files unlikely to match all strings (The `.sh` ones) - for performance when the whole file needs to be scanned) (I started with the entire source but ran out of patience waiting for my shell ones) – Gert van den Berg Apr 17 '18 at 14:40
3

You can

  • make use of the -o|--only-matching option of grep (which forces to output only the matched parts of a matching line, with each such part on a separate output line),

  • then eliminate duplicate occurrences of matched strings with sort -u,

  • and finally check that the count of remaining lines equals the count of the input strings.

Demonstration:

$ cat input 
...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on

$ grep -o -F $'string1\nstring2\nstring3' input|sort -u|wc -l
3

$ grep -o -F $'string1\nstring3' input|sort -u|wc -l
2

$ grep -o -F $'string1\nstring2\nfoo' input|sort -u|wc -l
2

One shortcoming with this solution (failing to meet the partial matches should be OK requirement) is that grep doesn't detect overlapping matches. For example, although the text abcd matches both abc and bcd, grep finds only one of them:

$ grep -o -F $'abc\nbcd' <<< abcd
abc

$ grep -o -F $'bcd\nabc' <<< abcd
abc

Note that this approach/solution works only for fixed strings. It cannot be extended for regexes, because a single regex can match multiple different strings and we cannot track which match corresponds to which regex. The best you can do is store the matches in a temporary file, and then run grep multiple times using one regex at a time.


The solution implemented as a bash script:

matchall:

#!/usr/bin/env bash

if [ $# -lt 2 ]
then
    echo "Usage: $(basename "$0") input_file string1 [string2 ...]"
    exit 1
fi

function find_all_matches()
(
    infile="$1"
    shift

    IFS=$'\n'
    newline_separated_list_of_strings="$*"
    grep -o -F "$newline_separated_list_of_strings" "$infile"
)

string_count=$(($# - 1))
matched_string_count=$(find_all_matches "$@"|sort -u|wc -l)

if [ "$matched_string_count" -eq "$string_count" ]
then
    echo "ALL strings matched"
    exit 0
else
    echo "Some strings DID NOT match"
    exit 1
fi

Demonstration:

$ ./matchall
Usage: matchall input_file string1 [string2 ...]

$ ./matchall input string1 string2 string3
ALL strings matched

$ ./matchall input string1 string2
ALL strings matched

$ ./matchall input string1 string2 foo
Some strings DID NOT match
Leon
  • 31,443
  • 4
  • 72
  • 97
3

The easiest way for me to check if the file has all three patterns is to get only matched patterns, output only unique parts and count lines. Then you will be able to check it with a simple Test condition: test 3 -eq $grep_lines.

 grep_lines=$(grep -Eo 'string1|string2|string3' file | uniq | wc -l)

Regarding your second question, I don't think it's possible to stop reading the file as soon as more than one pattern is found. I've read man page for grep and there are no options that could help you with that. You can only stop reading lines after specific one with an option grep -m [number] which happens no matter of matched patterns.

Pretty sure that a custom function is needed for that purpose.

Anna
  • 63
  • 11
3

A recursive solution. Iterate over the files one by one. For each file, check if it matches the first pattern and break early (-m1: on first match), only if it matched the first pattern, search for second pattern and so on:

#!/bin/bash

patterns="$@"

fileMatchesAllNames () {
  file=$1
  if [[ $# -eq 1 ]]
  then
    echo "$file"
  else
    shift
    pattern=$1
    shift
    grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
  fi
}

for file in *
do
  test -f "$file" && fileMatchesAllNames "$file" $patterns
done

Usage:

./allfilter.sh cat filter java
test.sh

Searches in the current dir for the tokens "cat", "filter" and "java". Found them only in "test.sh".

So grep is invoked often in the worst case scenario (finding the first N-1 patterns in the last line of each file, except for the N-th pattern).

But with an informed ordering (rarly matches first, early matchings first) if possible, the solution should be reasonable fast, since many files are abandoned early because they didn't match the first keyword, or accepted early, as they matched a keyword close to the top.

Example: You search a scala source file which contains tailrec (somewhat rarely used), mutable (rarely used, but if so, close to the top on import statements) main (rarely used, often not close to the top) and println (often used, unpredictable position), you would order them:

./allfilter.sh mutable tailrec main println 

Performance:

ls *.scala | wc 
 89      89    2030

In 89 scala files, I have the keywords distribution:

for keyword in mutable tailrec main println; do grep -m 1 $keyword *.scala | wc -l ; done 
16
34
41
71

Searching them with a slightly modified version of the scripts, which allows to use a filepattern as first argument takes about 0.2s:

time ./allfilter.sh "*.scala" mutable tailrec main println
Filepattern: *.scala    Patterns: mutable tailrec main println
aoc21-2017-12-22_00:16:21.scala
aoc25.scala
CondenseString.scala
Partition.scala
StringCondense.scala

real    0m0.216s
user    0m0.024s
sys 0m0.028s

in close to 15.000 codelines:

cat *.scala | wc 
  14913   81614  610893

update:

After reading in the comments to the question, that we might be talking about thounsands of patterns, handing them as arguments doesn't seem to be a clever idea; better read them from a file, and pass the filename as argument - maybe for the list of files to filter too:

#!/bin/bash

filelist="$1"
patternfile="$2"
patterns="$(< $patternfile)"

fileMatchesAllNames () {
  file=$1
  if [[ $# -eq 1 ]]
  then
    echo "$file"
  else
    shift
    pattern=$1
    shift
    grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
  fi
}

echo -e "Filepattern: $filepattern\tPatterns: $patterns"
for file in $(< $filelist)
do
  test -f "$file" && fileMatchesAllNames "$file" $patterns
done

If the number and length of patterns/files exceeds the possibilities of argument passing, the list of patterns could be split into many patternfiles and processed in a loop (for example of 20 pattern files):

for i in {1..20}
do
   ./allfilter2.sh file.$i.lst pattern.$i.lst > file.$((i+1)).lst
done
user unknown
  • 35,537
  • 11
  • 75
  • 121
  • 1
    OP is looking for a solution that doesn't require invoking grep multiple times (per file). – Leon Apr 13 '18 at 10:57
  • @Leon: That's a question, whether it is possible, and at least for all files, which don't match a former pattern, grep is not invoked any longer and for each match it is early stopped, so by an informed ordering (rarly matches first, early matchings first) if possible, the solution should be reasonable fast. – user unknown Apr 13 '18 at 11:02
1

It's an interesting problem, and there's nothing obvious in the grep man page to suggest an easy answer. There's might be an insane regex that would do it, but may be clearer with a straightforward chain of greps, even though that ends up scanning the file n-times. At least the -q option has it bail at the first match each time, and the && will shortcut evaluation if one of the strings is not found.

$grep -Fq string1 t && grep -Fq string2 t && grep -Fq string3 t
$echo $?
0

$grep -Fq string1 t && grep -Fq blah t && grep -Fq string3 t
$echo $?
1
Ian McGowan
  • 3,461
  • 3
  • 18
  • 23
  • 1
    This would work, however, OP says he doesn't want to call `grep` multiple times. – learningbee Apr 11 '18 at 19:52
  • 2
    Acknowledged, but I don't think there's another way to do it with plain grep, and at least the -q and && will shortcut some of the multiple passes over the data files. – Ian McGowan Apr 12 '18 at 01:53
1

Perhaps with gnu sed

cat match_word.sh

sed -z '
  /\b'"$2"'/!bA
  /\b'"$3"'/!bA
  /\b'"$4"'/!bA
  /\b'"$5"'/!bA
  s/.*/0\n/
  q
  :A
  s/.*/1\n/
' "$1"

and you call it like that :

./match_word.sh infile string1 string2 string3

return 0 if all match are found else 1

here you can look for 4 strings

if you want more, you can add lines like

/\b'"$x"'/!bA
ctac_
  • 2,413
  • 2
  • 7
  • 17
  • 1
    That script will not look for any strings, it will look for 4 regexps. If you want it to act as if it were looking for strings then see https://stackoverflow.com/q/29613304/1745001 for how to do that. – Ed Morton Apr 15 '18 at 06:17
1

Just for "solutions completeness", you can use a different tool and avoid multiple greps and awk/sed or big (and probably slow) shell loops; Such a tool is agrep.

agrep is actually a kind of egrep supporting also and operation between patterns, using ; as a pattern separator.

Like egrep and like most of the well known tools, agrep is a tool that operates on records/lines and thus we still need a way to treat the whole file as a single record.
Moreover agrep provides a -d option to set your custom record delimiter.

Some tests:

$ cat file6
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3

$ agrep -d '$$\n' 'str3;str2;str1;str4' file6;echo $?
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3
0

$ agrep -d '$$\n' 'str3;str2;str1;str4;str5' file6;echo $?
1

$ agrep -p 'str3;str2;str1' file6  #-p prints lines containing all three patterns in any position
str1 str2 str3
str3 str1 str2

No tool is perfect, and agrep has also some limitations; you can't use a regex /pattern longer than 32 chars and some options are not available when used with regexps- all these are explained in agrep man page

George Vasiliou
  • 6,130
  • 2
  • 20
  • 27
1

Ignoring the "Is it possible to do it without ... or use a tool like awk or python?" requirement, you can do it with a Perl script:

(Use an appropriate shebang for your system or something like /bin/env perl)

#!/usr/bin/perl

use Getopt::Std; # option parsing

my %opts;
my $filename;
my @patterns;
getopts('rf:',\%opts); # Allowing -f <filename> and -r to enable regex processing

if ($opts{'f'}) { # if -f is given
    $filename = $opts{'f'};
    @patterns = @ARGV[0 .. $#ARGV]; # Use everything else as patterns
} else { # Otherwise
    $filename = $ARGV[0]; # First parameter is filename
    @patterns = @ARGV[1 .. $#ARGV]; # Rest is patterns
}
my $use_re= $opts{'r'}; # Flag on whether patterns are regex or not

open(INF,'<',$filename) or die("Can't open input file '$filename'");


while (my $line = <INF>) {
    my @removal_list = (); # List of stuff that matched that we don't want to check again
    for (my $i=0;$i <= $#patterns;$i++) {
        my $pattern = $patterns[$i];
        if (($use_re&& $line =~ /$pattern/) || # regex match
            (!$use_re&& index($line,$pattern) >= 0)) { # or string search
            push(@removal_list,$i); # Mark to be removed
        }
    }
    # Now remove everything we found this time
    # We need to work backwards to keep us from messing
    # with the list while we're busy
    for (my $i=$#removal_list;$i >= 0;$i--) {
        splice(@patterns,$removal_list[$i],1);
    }
    if (scalar(@patterns) == 0) { # If we don't need to match anything anymore
        close(INF) or warn("Error closing '$filename'");
        exit(0); # We found everything
    }
}
# End of file

close(INF) or die("Error closing '$filename'");
exit(1); # If we reach this, we haven't matched everything

Is saved as matcher.pl this will search for plain text strings:

./matcher filename string1 string2 string3 'complex string'

This will search for regular expressions:

./matcher -r filename regex1 'regex2' 'regex4'

(The filename can be given with -f instead):

./matcher -f filename -r string1 string2 string3 'complex string'

It is limited to single line matching patterns (due to dealing with the file linewise).

The performance, when calling for lots of files from a shell script, is slower than awk (But search patterns can contain spaces, unlike the ones passed space-separated in -v to awk). If converted to a function and called from Perl code (with a file containing a list of files to search), it should be much faster than most awk implementations. (When called on several smallish files, the perl startup time (parsing, etc of the script) dominates the timing)

It can be sped up significantly by hardcoding whether regular expressions are used or not, at the cost of flexibility. (See my benchmarks here to see what effect removing Getopt::Std has)

Gert van den Berg
  • 2,448
  • 31
  • 41
1

Assuming all your strings to check are in a file strings.txt, and the file you want to check in is input.txt, the following one liner will do :

Updated the answer based on comments :

$ diff <( sort -u strings.txt )  <( grep -o -f strings.txt input.txt | sort -u )

Explanation :

Use grep's -o option to match only the strings you are interested in. This gives all the strings that are present in the file input.txt. Then use diff to get the strings that are not found. If all the strings were found, the result would be nothing. Or, just check the exit code of diff.

What it does not do :

  • Exit as soon as all matches are found.
  • Extendible to regx.
  • Overlapping matches.

What it does do :

  • Find all matches.
  • Single call to grep.
  • Does not use awk or python.
Gautam
  • 1,862
  • 9
  • 16
  • 1
    Instead of using `awk` to create the expression for `grep`, we could definitely use `grep -f` which would work much better when we have a lot of strings to search for. That way, it would be easier to handle any special characters in the strings. – codeforester Apr 16 '18 at 20:05
  • 1
    Ignoring the UUOCs, this searches for regexps, not strings. You're using awk to create a regexp like `regexp1|regexp2|...` that gets saved in the badly-named variable `STR` and then you're using `grep -o` to find it. The grep `-o` MUST do a regexp match since you're relying on the `|`s in STR so every part of `STR` is treated as a regexp and there's nothing you can do to fix that. It will also fail in several ways, e.g. when one "string" is part of another, e.g. try `$ echo 'footherebar' | grep -E -o 'the|there'` and you'll find the output is just `there`, not `the` and `there` as required. ` – Ed Morton Apr 16 '18 at 20:47
  • The new version, can probably deal with plain strings (by adding a `-F` to the `grep`) (and without `-F` or with `-G`, standard regexes) (And can deal with extended regexes with `-E` or other non-standard variants, like `-P` for PCRE regexes in GNU grep) – Gert van den Berg Apr 17 '18 at 13:14
  • 1
    It still can't handle strings being contained in other strings though. Again, try any variation of `echo 'footherebar' | grep -E -o 'the|there'` and it'll find `there` but not `the`. – Ed Morton Apr 17 '18 at 13:34
  • Yes, Strangely, if the contents of my strings.txt is "there\nthe" , it matches both the strings, whereas if it is "the\nthere" it matches only one. :P . I have modified my answer to say it isn't able to though. – Gautam Apr 17 '18 at 14:10
  • Probably `cmp` is better than `diff` in this context. – codeforester Apr 19 '18 at 17:33
1
perl -lne '%m = (%m, map {$_ => 1} m!\b(string1|string2|string3)\b!g); END { print scalar keys %m == 3 ? "Match": "No Match"}' file
binish
  • 109
  • 1
  • 6
  • 1
    Is that **really** looking for strings or is it looking for regexps? I mean if string1 was `".*"` would it **only** match against the string `".*"` in the input or would it match against any sequence of characters? – Ed Morton Apr 18 '18 at 12:01
  • Its technically regexps, but in this context it is looking for strings. If there are any metachars, you need to escape appropriately. – binish Apr 18 '18 at 16:28
  • 1
    Doesn't perl have some option to do the escaping for you (like shells `printf '%q'`) so your regexps kinda-sorta get string-like treatment? I thought I saw that somewhere. In any case, its very misleading to use the name "string" for your regexps. Seems to be a common theme in this thread though, idk why.... – Ed Morton Apr 18 '18 at 17:04
  • One of the downsides here is that it reads the entire file always and doesn't stop as soon as all strings match. – codeforester Apr 19 '18 at 17:22
0

In python using the fileinput module allows the files to be specified on the command line or the text read line by line from stdin. You could hard code the strings into a python list.

# Strings to match, must be valid regular expression patterns
# or be escaped when compiled into regex below.
strings = (
    r'string1',
    r'string2',
    r'string3',
)

or read the strings from another file

import re
from fileinput import input, filename, nextfile, isfirstline

for line in input():
    if isfirstline():
        regexs = map(re.compile, strings) # new file, reload all strings

    # keep only strings that have not been seen in this file
    regexs = [rx for rx in regexs if not rx.match(line)] 

    if not regexs: # found all strings
        print filename()
        nextfile()
Mike Robins
  • 1,733
  • 10
  • 14
0

Many of these answers are fine as far as they go.

But if performance is an issue -- certainly possible if the input is large and you have many thousands of patterns -- then you'll get a large speedup using a tool like lex or flex that generates a true deterministic finite automaton as a recognizer rather than calling a regex interpreter once per pattern.

The finite automaton will execute a few machine instructions per input character regardless of the number of patterns.

A no-frills flex solution:

%{
void match(int);
%}
%option noyywrap

%%

"abc"       match(0);
"ABC"       match(1);
[0-9]+      match(2);
/* Continue adding regex and exact string patterns... */

[ \t\n]     /* Do nothing with whitespace. */
.   /* Do nothing with unknown characters. */

%%

// Total number of patterns.
#define N_PATTERNS 3

int n_matches = 0;
int counts[10000];

void match(int n) {
  if (counts[n]++ == 0 && ++n_matches == N_PATTERNS) {
    printf("All matched!\n");
    exit(0);
  }
}

int main(void) {
  yyin = stdin;
  yylex();
  printf("Only matched %d patterns.\n", n_matches);
  return 1;
}

A down side is that you'd have to build this for every given set of patterns. That's not too bad:

flex matcher.y
gcc -O lex.yy.c -o matcher

Now run it:

./matcher < input.txt
Gene
  • 46,253
  • 4
  • 58
  • 96
  • 1
    I bet you don't get a performance improvement over [the awk solution](https://stackoverflow.com/a/49786059/1745001) since awk is highly optimized for text processing, usually out-performing compiled languages like "C" or that task. Remember we're looking for strings btw, not regexps. – Ed Morton Apr 18 '18 at 11:59
  • @EdMorton: For non-regexes, a C version based on [this](https://stackoverflow.com/a/8584708/1837991) on `mmap`ed files runs my benchmark in around 1.5 seconds... (Using `argv` for search strings, not a file) (Extending the simple C case to regexes is not that easy though....) – Gert van den Berg Apr 18 '18 at 15:42
  • 1
    Again, it's important to be searching for 1,000+ strings like the OP says is realistic.Searching for, say, 3 strings really isn't very useful wrt comparative timing. – Ed Morton Apr 18 '18 at 15:56
  • @EdMorton Awk is a great tool, but it uses a regex interpreter. On 5,000 patterns, it will proceed by attempting to match each of the 5,000 patterns in sequence for each input character. Flex will compile all 5000 patterns into a single DFA that executes a few instructions per input character. That's why compiler scanners - where performance impacts every program ever compiled for the life of the scanner - are implemented with DFAs, not regex engines. – Gene Apr 18 '18 at 21:57
  • We are not working with regexps in this question we are working with strings. Awks regexp engine is not being used and even if it were it would not be working with a single regexp and the algorithm used would reduce the number of regexps each time one is found. – Ed Morton Apr 18 '18 at 23:14
  • @EdMorton Actually the question says "we could have regexes in place of strings." Whether the patterns are simple strings or use alternation and Kleene star, my point is the same. For each character, an interpreter will - for each character - try each pattern in sequence until it finds a match. A DFA will do the equivalent of a C `switch` branch on each character and then move on to the next one. – Gene Apr 19 '18 at 00:01
  • I took that as meaning the strings could contain regexp metacharacters but they should be treated as literal strings regardless. That's fine but that's not what a reasonable awk solution to the problem of finding multiple strings in a file will do, it'll simply loop through an array of those strings and remove each one that's found in the input file until the array is empty or the end of file is reached. If you think your approach will be faster then just try it and let us know the results. – Ed Morton Apr 19 '18 at 00:30
0

For plain speed, with no external tool limitations, and no regexes, this (crude) C version does a decent job. (Possibly Linux only, although it should work on all Unix-like systems with mmap)

#include <sys/mman.h>
#include <sys/stat.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

/* https://stackoverflow.com/a/8584708/1837991 */
inline char *sstrstr(char *haystack, char *needle, size_t length)
{
    size_t needle_length = strlen(needle);
    size_t i;
    for (i = 0; i < length; i++) {
        if (i + needle_length > length) {
            return NULL;
        }
        if (strncmp(&haystack[i], needle, needle_length) == 0) {
            return &haystack[i];
        }
    }
    return NULL;
}

int matcher(char * filename, char ** strings, unsigned int str_count)
{
    int fd;
    struct stat sb;
    char *addr;
    unsigned int i = 0; /* Used to keep us from running of the end of strings into SIGSEGV */

    fd = open(filename, O_RDONLY);
    if (fd == -1) {
        fprintf(stderr,"Error '%s' with open on '%s'\n",strerror(errno),filename);
        return 2;
    }

    if (fstat(fd, &sb) == -1) {          /* To obtain file size */
        fprintf(stderr,"Error '%s' with fstat on '%s'\n",strerror(errno),filename);
        close(fd);
        return 2;
    }

    if (sb.st_size <= 0) { /* zero byte file */
        close(fd);
        return 1; /* 0 byte files don't match anything */
    }

    /* mmap the file. */
    addr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (addr == MAP_FAILED) {
        fprintf(stderr,"Error '%s' with mmap on '%s'\n",strerror(errno),filename);
        close(fd);
        return 2;
    }

    while (i++ < str_count) {
        char * found = sstrstr(addr,strings[0],sb.st_size);
        if (found == NULL) {  /* If we haven't found this string, we can't find all of them */
            munmap(addr, sb.st_size);
            close(fd);
            return 1; /* so give the user an error */
        }
        strings++;
    }
    munmap(addr, sb.st_size);
    close(fd);
    return 0; /* if we get here, we found everything */
}

int main(int argc, char *argv[])
{
    char *filename;
    char **strings;
    unsigned int str_count;
    if (argc < 3) { /* Lets count parameters at least... */
        fprintf(stderr,"%i is not enough parameters!\n",argc);
        return 2;
    }
    filename = argv[1]; /* First parameter is filename */
    strings = argv + 2; /* Search strings start from 3rd parameter */
    str_count = argc - 2; /* strings are two ($0 and filename) less than argc */

    return matcher(filename,strings,str_count);
}

Compile it with:

gcc matcher.c -o matcher

Run it with:

./matcher filename needle1 needle2 needle3

Credits:

Notes:

  • It will scan through the parts of the file preceding the matched strings multiple times - it will only open the file once though.
  • The entire file might end up loaded into memory, especially if a string doesn't match, the OS needs to decide that
  • regex support can probably be added by using the POSIX regex library (Performance would likely be slightly better than grep - it is should be based on the same library and you would gain reduced overhead from only opening the file once for searching for multiple regexes)
  • Files containing nulls should work, search strings with them not though...
  • All characters other than null should be searchable (\r, \n, etc)
greybeard
  • 2,249
  • 8
  • 30
  • 66
Gert van den Berg
  • 2,448
  • 31
  • 41
0

The following python script should do the trick. It kind of does call the equivalent of grep (re.search) multiple times for each line -- i.e. it it searches each pattern for each line, but since you are not forking out a process each time, it should be much more efficient. Also, it removes the patterns which have already been found and stops when all of them have been found.

#!/usr/bin/env python

import re

# the file to search
filename = '/path/to/your/file.txt'

# list of patterns -- can be read from a file or command line 
# depending on the count
patterns = [r'py.*$', r'\s+open\s+', r'^import\s+']
patterns = map(re.compile, patterns)

with open(filename) as f:
    for line in f:
        # search for pattern matches
        results = map(lambda x: x.search(line), patterns)

        # remove the patterns that did match
        results = zip(results, patterns)
        results = filter(lambda x: x[0] == None, results)
        patterns = map(lambda x: x[1], results)

        # stop if no more patterns are left
        if len(patterns) == 0:
            break

# print the patterns which were not found
for p in patterns:
    print p.pattern

You can add a separate check for plain strings (string in line) if you are dealing with plain (non-regex) strings -- will be slightly more efficient.

Does that solve your problem?

Satyen Rai
  • 1,493
  • 1
  • 12
  • 19
0

One more Perl variant - whenever all given strings match..even when the file is read half through, the processing completes and just prints the results

> perl -lne ' /\b(string1|string2|string3)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ? "Match": "No Match"}'  all_match.txt
Match
> perl -lne ' /\b(string1|string2|stringx)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ? "Match": "No Match"}'  all_match.txt
No Match
stack0114106
  • 8,534
  • 3
  • 13
  • 38
0

First delete the line separator, and then use normal grep multiple times, as the number of patterns as in below.

Example: Let the file content be as below

PAT1
PAT2
PAT3
something
somethingelse

cat file | tr -d "\n" | grep "PAT1" | grep "PAT2" | grep -c "PAT3"
Sun
  • 1,505
  • 17
  • 25
-1

I didn't see a simple counter among answers, so here is a counter oriented solution using awk that stops as soon as all matches are satisfied:

/string1/ { a = 1 }
/string2/ { b = 1 }
/string3/ { c = 1 }
{
    if (c + a + b == 3) {
        print "Found!";
        exit;
    }
}

A generic script

to expand usage through shell arguments:

#! /bin/sh
awk -v vars="$*" -v argc=$# '
BEGIN { split(vars, args); }
{
    for (arg in args) {
        if (!temp[arg] && $0 ~ args[arg]) {
            inc++;
            temp[arg] = 1;
        }
    }

    if (inc == argc) {
        print "Found!";
        exit;
    }
}
END { exit 1; }
' filename

Usage (in which you can pass Regular Expressions):

./script "str1?" "(wo)?men" str3

or to apply a string of patterns:

./script "str1? (wo)?men str3"
revo
  • 47,783
  • 14
  • 74
  • 117
  • The OP tells us she has thousands of strings to search for and so hard-coding them all as in your first script won't be practical (plus it's searching for regexps, not strings). Your second script is similar to [mine](https://stackoverflow.com/a/49786059/1745001) in terms of using a counter but instead of reducing the array of strings on every loop like mine does, yours is creating a duplicate array of strings and so will be slower than mine and use twice as much memory. It wasn't me who downvoted btw. – Ed Morton Apr 17 '18 at 12:04
  • Where am I creating duplicate array of strings? and I don't get you by * it's searching for regexps, not strings* since whatever the input is, a pattern or a literal string, it doesn't confuse. @EdMorton – revo Apr 17 '18 at 12:08
  • In your first script `/string/` is defining a repexp literal with the contents `string` and then doing a regexp comparison of that against `$0`, i.e. it's searching for a regexp not a string. In your second script you have an array `args[]` that contains all of your strings and each time you match a string in the input you're adding it to the array `temp[]` so when all strings are found in the input you end up with `temp[]` being a duplicate of `args[]`. – Ed Morton Apr 17 '18 at 12:13
  • Oh, hang on - I misread your second script. I assumed you were using `args[]` to contain the arg strings as indices so you could do a simple loop on the indices and a string comparison like I'm doing but you aren't you're storing the strings as array contents and looping on numeric indices and then dereferencing the array each time and doing a regexp comparison! So `arg` in your second script isn't actually an arg (one of the strings passed in) it's an index and the arg/string is actually at `args[arg]` so you aren't creating a dup in `temp[]` but there's other issues instead. – Ed Morton Apr 17 '18 at 12:21
  • I guessed you did probably on if condition... but let's talk about strings vs regexps. I see it as one single solution for both. You provided two individually and I'm not sure how it acts inefficiently on strings vs regexps. Maybe a benchmark needed? @EdMorton – revo Apr 17 '18 at 12:27
  • Try both solutions when the string being searched for is `.*`. Mine will find the exact string `.*` (i.e.a period followed by an asterisk) if and only if it exists in the input file while yours will match on the first line no matter what the input file contents since `.*` will be treated as the regexp "zero or more repetitions of any character". wrt efficiency, I'm not considering the efficiency of regexp vs string comparisons, I'm removing each string from my array when it matches so the array gets smaller and so faster to loop through on every match. Feel free to do timing tests if you like. – Ed Morton Apr 17 '18 at 12:31
  • Isn't it dependent on OP? in case of `.*` you are right but if OP doesn't have quantifiers or regex meta characters in his / her arguments then what would be the issue? and since a line is read at a time, even `.*` which prints wrong substring, talking performance-wise, doesn't differ from matching a literal string like `s`. Even if whole input string is one single line by the mean of `\n` being the RS. @EdMorton – revo Apr 17 '18 at 12:38
  • Right, if the OP chooses not to have any regexp metacharacters then there are no false matches produced by regexp metacharacters but why impose that restriction when you can just as easily do string comparisons? The OP **said** she wants to find strings so why then say 'ok but you can't simply look for "foo.c" in a list of file names', etc.. Again, when I'm talking about performance I'm not talking about the difference in a string vs regexp comparison, I'm talking about reducing the number of values being looped on for every time a value is found vs keeping the number of iterations constant. – Ed Morton Apr 17 '18 at 12:57
  • To do a search without regexes, you can use `index` (see your local `awk` man page) instead of `=~`. Two variants is one option, another is to have a setting that chooses whether `index` or `=~` is used. – Gert van den Berg Apr 18 '18 at 09:20
  • @GertvandenBerg Right, you can use index() to do a string instead of regexp comparison and you can delete the matched strings from args instead of adding flags to temp to improve performance and then you end up with [my answer](https://stackoverflow.com/a/49786059/1745001) – Ed Morton Apr 18 '18 at 12:04
-1
$ cat allstringsfile | tr '\n' ' ' |  awk -f awkpattern1

Where allstringsfile is your text file, as in the original question. awkpattern1 contains the string patterns, with && condition:

$ cat awkpattern1
/string1/ && /string2/ && /string3/