1

I'm trying to check if a multi-line string exists in a file using common bash commands (grep, awk, ...).

I want to have a file with a few lines, plain lines, not patterns, that should exists in another file and create a command (sequence) that checks if it does. If grep could accept arbitrary multiline patterns, I'd do it with something similar to

grep "`cat contentfile`" targetfile

As with grep I'd like to be able to check the exit code from the command. I'm not really interested in the output. Actually no output would be preferred since then I don't have to pipe to /dev/null.

I've searched for hints, but can't come up with a search that gives any good hits. There's How can I search for a multiline pattern in a file?, but that is about pattern matching.

I've found pcre2grep, but need to use "standard" *nix tools.

Example:

contentfile:

line 3
line 4
line 5

targetfile:

line 1
line 2
line 3
line 4
line 5
line 6

This should match and return 0 since the sequence of lines in the content file is found (in the exact same order) in the target file.

EDIT: Sorry for not being clear about the "pattern" vs. "string" comparison and the "output" vs. "exit code" in the previous versions of this question.

thoni56
  • 3,145
  • 3
  • 31
  • 49
  • Are you on Linux? Or do you need MacOS/BSD compatibility? – John1024 Jul 13 '19 at 07:32
  • `perl -0777 -pe 'exit 0 if s/'"$(cat patternfile)"'//; exit 1' targetfile`? – Cyrus Jul 13 '19 at 07:37
  • @Cyrus Works, at least for some simple tests I just made. Turn it into an answer, please. – thoni56 Jul 13 '19 at 07:47
  • It doesn't work if patternfile contains `/`. I'm sure there are even better solutions. – Cyrus Jul 13 '19 at 07:53
  • 1
    This might help: [How to know if a text file is a subset of another](https://unix.stackexchange.com/q/114877/74329) – Cyrus Jul 13 '19 at 08:10
  • Possible duplicate of [How can I search for a multiline pattern in a file?](https://stackoverflow.com/questions/152708/how-can-i-search-for-a-multiline-pattern-in-a-file) – tripleee Jul 13 '19 at 09:00
  • @John1024 MacOS/BSD compatibilty is not required, but of course an added benefit. – thoni56 Jul 13 '19 at 21:12

8 Answers8

2

You didn't say if you wanted a regexp match or string match and we can't tell since you named your search file "patternfile" and a "pattern" could mean anything and at one point you imply you want to do a string match (check if a multi-line _string_ exists) but then you're using grep and pcregpre with no stated args for string rather than regexp matches.

In any case, these will do whatever it is you want using any awk (which includes POSIX standard awk and you said you wanted to use standard UNIX tools) in any shell on every UNIX box:

For a regexp match:

$ cat tst.awk
NR==FNR { pat = pat $0 ORS; next }
{ tgt = tgt $0 ORS }
END {
    while ( match(tgt,pat) ) {
        printf "%s", substr(tgt,RSTART,RLENGTH)
        tgt = substr(tgt,RSTART+RLENGTH)
    }
}

$ awk -f tst.awk patternfile targetfile
line 3
line 4
line 5

For a string match:

$ cat tst.awk
NR==FNR { pat = pat $0 ORS; next }
{ tgt = tgt $0 ORS }
END {
    lgth = length(pat)
    while ( beg = index(tgt,pat) ) {
        printf "%s", substr(tgt,beg,lgth)
        tgt = substr(tgt,beg+lgth)
    }
}

$ awk -f tst.awk patternfile targetfile
line 3
line 4
line 5

Having said that, with GNU awk you could do the following if you're OK with a regexp match and backslash interpretation of the patternfile contents (so \t is treated as a literal tab):

$ awk -v RS="$(cat patternfile)" 'RT!=""{print RT}' targetfile
line 3
line 4
line 5

or with GNU grep:

$ grep -zo "$(cat patternfile)" targetfile | tr '\0' '\n'
line 3
line 4
line 5

There are many other options depending on what kind of match you're really trying to do and which tools versions you have available.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Both your suggestions return with exit status 0 whatever the match is. I need to be able to check the result in a bash/makefile. – thoni56 Jul 13 '19 at 20:52
  • That is correct. You should have stated that in your question of course but it's an absolutely trivial tweak to change - let me know if you have any trouble implementing that and if you do need help then make sure to edit your question to include all relevant information, including if you're trying to do a string or regexp match, what should be printed to stdout/stderr, what the exit status should be, etc. – Ed Morton Jul 14 '19 at 04:03
  • It must be mentioned that the first case will fail if you have an EByte date file. ;-). Very nice RS solution. – kvantour Jul 14 '19 at 06:25
1

EDIT: Since OP needs outcome of command in form of true or false(yes or no), so edited command in that manner now(created and tested in GNU awk).

awk -v message="yes" 'FNR==NR{a[$0];next} ($0 in a){if((FNR-1)==prev){b[++k]=$0} else {delete b;k=""}} {prev=FNR}; END{if(length(b)>0){print message}}'  patternfile  targetfile


Could you please try following, tested with given samples and it should print all continuous lines from pattern file if they are coming in same order in target file(count should be at least 2 for continuous lines in this code).

awk '
FNR==NR{
  a[$0]
  next
}
($0 in a){
  if((FNR-1)==prev){
      b[++k]=$0
  }
  else{
      delete b
      k=""
  }
}
{
  prev=FNR
}
END{
  for(j=1;j<=k;j++){
      print b[j]
  }
}'  patternfile  targetfile

Explanation: Adding explanation for above code here.

awk '                                     ##Starting awk program here.
FNR==NR{                                  ##FNR==NR will be TRUE when first Input_file is being read.
  a[$0]                                   ##Creating an array a with index $0.
  next                                    ##next will skip all further statements from here.
}
($0 in a){                                ##Statements from here will will be executed when 2nd Input_file is being read, checking if current line is present in array a.
  if((FNR-1)==prev){                      ##Checking condition if prev variable is equal to FNR-1 value then do following.
      b[++k]=$0                           ##Creating an array named b whose index is variable k whose value is increment by 1 each time it comes here.
  }
  else{                                   ##Mentioning else condition here.
      delete b                            ##Deleting array b here.
      k=""                                ##Nullifying k here.
  }
}
{
  prev=FNR                                ##Setting prev value as FNR value here.
}
END{                                      ##Starting END section of this awk program here.
  for(j=1;j<=k;j++){                      ##Starting a for loop here.
      print b[j]                          ##Printing value of array b whose index is variable j here.
  }
}'  patternfile  targetfile               ##mentioning Input_file names here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    Awesome if you could explain how this solution works. – codeforester Jul 13 '19 at 07:35
  • @codeforester, have added explanation with solution now, cheers. – RavinderSingh13 Jul 13 '19 at 07:44
  • when changing last line of patternfile to i.e. `line 6`, this script will output `line3; line 4; line 6`. which is not desired output? – Luuk Jul 13 '19 at 07:49
  • @Luuk, I believe IMHO which should be desired output, since all are coming in same sequence, OP could confirm it. – RavinderSingh13 Jul 13 '19 at 07:52
  • In this particular case I'm only interested in the return code. zero for yes, anything else for no. The command could very well be completely silent, that's even preferred since I wouldn't have to pipe to /dev/null... – thoni56 Jul 13 '19 at 08:35
  • @thoni56, you mean to say if any 1 continuous series is found(at least one) then return yes or return no right? – RavinderSingh13 Jul 13 '19 at 08:38
  • Yes, if the content of the pattern file is found at least once in the target file, return 0 (as shell commands tend to do... – thoni56 Jul 13 '19 at 08:43
  • 1
    The return-code of your `awk` statement is still `0` (or `OK`) when there is no match. Normally the returncode can/should be checked doing `echo $?` after the statement, just returning 'yes' is not sufficient. my `one-liner` has the same problem, I will edit this in a moment... – Luuk Jul 13 '19 at 09:14
  • @Luuk, Np at all, I have created a variable named `message` which is as of now set to `yes`, OP could change it as per OP's need, cheers. – RavinderSingh13 Jul 13 '19 at 09:26
  • This will fail if the first line of pattern appears multiple times in pattern. – kvantour Jul 13 '19 at 10:00
1

a one-liner:

$ if [ $(diff --left-column -y patternfile targetfile | grep '(' -A1 -B1 | tail -n +2 | head -n -1 | wc -l) == $(cat patternfile | wc -l) ]; then echo "ok"; else echo "error"; fi 

explanation:

first is to compare the two files using diff:

diff --left-column -y patternfile targetfile
                                      > line 1
                                      > line 2
line 3                                (
line 4                                (
line 5                                (
                                      > line 6  

then filter to show only interesting lines, which are the lines the '(', plus extra 1-line before, and after match, to check if lines in patternfile match without a break.

diff --left-column -y patternfile targetfile | grep '(' -A1 -B1 

                                      > line 2
line 3                                (
line 4                                (
line 5                                (
                                      > line 6  

Then leave out the first, and last line:

diff --left-column -y patternfile targetfile | grep '(' -A1 -B1 | tail -n +2 | head -n -1

line 3                                (
line 4                                (
line 5                                (

add some code to check if the number of lines match the number of lines in the patternfile:

if [ $(diff --left-column -y patternfile targetfile | grep '(' -A1 -B1 | tail -n +2 | head -n -1 | grep '(' | wc -l) == $(cat patternfile | wc -l) ]; then echo "ok"; else echo "error"; fi

ok

to use this with a return-code, a script could be created like this:

#!/bin/bash
patternfile=$1                                                                                                          
targetfile=$2
if [ $(diff --left-column -y $patternfile $targetfile | grep '(' -A1 -B1 | tail -n +2 | head -n -1 | grep '(' | wc -l) == $(cat $patternfile | wc -l) ]; 
then 
   exit 0; 
else 
   exit 1; 
fi

The test (when above script is named comparepatterns):

$ comparepatterns patternfile targgetfile
echo $?
0
Luuk
  • 12,245
  • 5
  • 22
  • 33
1

another solution in awk:

echo $(awk 'FNR==NR{ a[$0]; next}{ x=($0 in a)?x+1:0 }x==length(a){ print "OK" }' patternfile targetfile ) 

This returns "OK" if there is a match.

Luuk
  • 12,245
  • 5
  • 22
  • 33
0

The easiest way to do this is to use a sliding window. First you read the pattern file, followed by file to search.

(FNR==NR) { a[FNR]=$0; n=FNR; next }
{ b[FNR]=$0 }
(FNR >= n) { for(i=1; i<=n;++i) if (a[i] != b[FNR-n+i]) { delete b[FNR-n+1]; next}}
{ print "match at", FNR-n+1}
{ r=1}
END{ exit !r}

which you call as

awk -f script.awk patternFile searchFile
kvantour
  • 25,269
  • 4
  • 47
  • 72
0

Following up on a comment from Cyrus, who pointed to How to know if a text file is a subset of another, the following Python one-liner does the trick

python -c "content=open('content').read(); target=open('target').read(); exit(0 if content in target else 1);"
thoni56
  • 3,145
  • 3
  • 31
  • 49
0

Unless you're talking about 10 GB+, here's an awk-based solution that's fast and clean :

mawk '{ exit NF==NR }' RS='^$' FS="${multiline_pattern}"   
  • The pattern exists only in the file "${m2p}"
  • which is embedded within multi-file pipeline of 1st test,
  • but not 2nd one

This solution, for now, doesn't auto handle instances where regex meta-character escaping is needed. Alter it as you see fit.

Unless the pattern occurs far too often, it might even save time to do it all at once instead of having to check line-by-line, including saving lines along the way in some temp pattern space.

NR is always 1 there since RS is forced to the tail end of the input. NF is larger than 1 only when the pattern is found. By evaluating exit NF == NR, it inverts the match, thus matching structure of posix exit codes.

% echo; ( time ( \
                  \
  echo "\n\n multi-line-pattern :: \n\n "      \
       "-------------\n${multiline_pattern}\n"  \
       " -----------\n\n "                       \
       "$( nice gcat "${m2m}" "${m3m}"    "${m3l}" "${m2p}" \
                     "${m3r}" "${m3supp}" "${m3t}" | pvE0    \
                                                   \
         | mawk2 '{ exit NF == NR
           }'            RS = '^$'                  \
                         FS = "${multiline_pattern}" \
                          \
        ) exit code : ${?} " ) )  | ecp 

 
      in0: 3.10GiB 0:00:01 [2.89GiB/s] [2.89GiB/s] [ <=> ]
( echo ; )  0.77s user 1.74s system 110% cpu 2.281 total


 multi-line-pattern :: 

 -------------
77138=1159=M
77138=1196=M
77138=1251=M
77138=1252=M
77138=4951=M
77138=16740=M
77138=71501=M
 -----------

  exit code : 0 
 
% echo; ( time ( \
                  \
  echo "\n\n multi-line-pattern :: \n\n "      \
       "-------------\n${multiline_pattern}\n"  \
       " -----------\n\n "                       \
       "$( nice gcat "${m2m}" "${m3m}"    "${m3l}" \
                     "${m3r}" "${m3supp}" "${m3t}" | pvE0 \
                                                   \
         | mawk2 '{ exit NF == NR
           }'            RS = '^$'                  \
                         FS = "${multiline_pattern}" \
                          \
        ) exit code : ${?} " ) )  | ecp 

 
      in0: 2.95GiB 0:00:01 [2.92GiB/s] [2.92GiB/s] [ <=> ]
( echo ; )  0.64s user 1.65s system 110% cpu 2.074 total


 multi-line-pattern :: 

 -------------
77138=1159=M
77138=1196=M
77138=1251=M
77138=1252=M
77138=4951=M
77138=16740=M
77138=71501=M
 -----------

  exit code : 1 
 

If your pattern is the full file, then something like this - even when using the full file as a single gigantic 153 MB pattern, it finished in less than 2.4 secs against ~3 GB input.

echo 
( time ( nice gcat "${m2m}" "${m3m}" "${m3l}" "${m3r}" "${m3supp}" "${m3t}" | pvE0 \
  \
  | mawk2              -v pattern_file="${m2p}" '
    BEGIN { 
                     RS = "^$"
             getline FS < pattern_file
                    close(pattern_file) 
    } END {  
             exit NF == NR }' ; echo "\n\n exit code :: $?\n\n" ))|ecp; 

du -csh "${m2p}" ; 
( time (  nice gcat "${m2m}" "${m3m}" "${m3l}" \
        "${m2p}" "${m3r}" "${m3supp}" "${m3t}" | pvE0 \
   \
   | mawk2             -v pattern_file="${m2p}" '
     BEGIN { 
                     RS = "^$"
             getline FS < pattern_file
                    close(pattern_file) 
     } END  {  
             exit NF == NR }' ; echo "\n\n exit code :: $?\n\n" ))|ecp; 

 
      in0: 2.95GiB 0:00:01 [2.58GiB/s] [2.58GiB/s] [ <=> ]
( nice gcat "${m2m}" "${m3m}" "${m3l}" "${m3r}" "${m3supp}" "${m3t}" | pvE 0.)  

 0.82s user 1.71s system 111% cpu 2.260 total


 exit code :: 1


 
153M    /Users/************/m2map_main.txt
153M    total
 
      in0: 3.10GiB 0:00:01 [2.56GiB/s] [2.56GiB/s] [ <=> ]
( nice gcat "${m2m}" "${m3m}" "${m3l}" "${m2p}" "${m3r}" "${m3supp}" "${m3t}")

 0.83s user 1.79s system 112% cpu 2.339 total


 exit code :: 0
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
0

Found a portable solution using patch command. The idea is to create a diff/patch in remove direction and check if it could be applied to the source file. Sadly there is no option for a dry-run (in my old patch version). So we've to do the patch and remove the temporary files.

The shell part around is optimized for my ksh usage:

file_in_file() {
        typeset -r vtmp=/tmp/${0%.sh}.$$.tmp
        typeset -r vbasefile=$1
        typeset -r vcheckfile=$2
        typeset -ir vlines=$(wc -l < "$vcheckfile")
        { echo "1,${vlines}d0"; sed 's/^/< /' "$vcheckfile"; } |
        patch -fns -F0 -o "$vtmp" "$vbasefile" >/dev/null 2>&1
        typeset -ir vrc=$?
        rm -f "$vtmp"*
        return $vrc
}

Explanation:

  1. set variables for local usage (on newer bash you should use declare instead)
  2. count lines of input file
  3. create a patch/diff file in-memory (the line with the curly brackets)
  4. use patch with strict settings patch -F0
  5. cleanup (also eventually created reject files: rm -f "$vtmp"*)
  6. return RC of patch