How to check if one file is part of other?

Question

I need to check if one file is inside another file by bash script. For a given multiline pattern and input file.

Return value:

I want to receive status (how in grep command) 0 if any matches were found, 1 if no matches were found.

Pattern:

multiline,
order of lines is important (treated as a single block of lines),
includes characters such as numbers, letters, ?, &, *, # etc.,

Explanation

Only the following examples should found matches:

pattern     file1 file2 file3 file4
222         111   111   222   222
333         222   222   333   333
            333   333         444
            444

the following should't:

pattern     file1 file2 file3 file4 file5 file6 file7
222         111   111   333   *222  111   111   222
333         *222  222   222   *333  222   222   
            333   333*        444   111         333
            444                     333   333

Here's my script:

#!/bin/bash

function writeToFile {
    if [ -w "$1" ] ; then
        echo "$2" >> "$1"
    else
        echo -e "$2" | sudo tee -a "$1" > /dev/null
    fi
}

function writeOnceToFile {
        pcregrep --color -M "$2" "$1"
        #echo $?

        if [ $? -eq 0 ]; then
            echo This file contains text that was added previously
        else
            writeToFile "$1" "$2"
        fi
}

file=file.txt 
#1?1
#2?2
#3?3
#4?4

pattern=`cat pattern.txt`
#2?2
#3?3

writeOnceToFile "$file" "$pattern"

I can use grep command for all lines of pattern, but it fails with this example:

file.txt 
#1?1
#2?2
#=== added line
#3?3
#4?4

pattern.txt
#2?2
#3?3

or even if you change lines: 2 with 3

file=file.txt 
#1?1
#3?3
#2?2
#4?4

returning 0 when it should't.

How do I can fix it? Note that I prefer to use native installed programs (if this can be without pcregrep). Maybe sed or awk can resolve this problem?

Are you trying to find out if any given line exists in the file already or if the entire set of new lines exists as a single block of lines in the file already? — Etan Reisner, Jul 21 '15 at 14:08
I want to check if full pattern (as a single block of lines) exist in input file. — abrzozowski, Jul 21 '15 at 14:13
You might want to update your question to make it clear earlier that this is subtly different from a substring match ignoring newlines. Because as your `...\n*222\n333\n...` not matching case shows, you need to pattern block to match starting at the beginning of a line, and ending at the end of a line. — Peter Cordes, Jul 22 '15 at 05:10

fedorqui · Answer 1 · 2015-07-21T14:50:14.483

7

I would just use diff for this task:

diff pattern <(grep -f file pattern)

Explanation

diff file1 file2 reports if two files differ or not.
By saying grep -f file pattern you are seeing what content of pattern is in file.

So what you are doing is to check what lines from pattern are in file and then comparing this to pattern itself. If they match, it means that pattern is a subset of file!

Tests

seq 10 is part of seq 20! Let's check it:

$ diff <(seq 10) <(grep -f <(seq 20) <(seq 10))
$

seq 10 is not exactly inside seq 2 20 (1 is not in the second one):

$ diff -q <(seq 10) <(grep -f <(seq 2 20) <(seq 10))
Files /dev/fd/63 and /dev/fd/62 differ

edited Jul 21 '15 at 14:50

answered Jul 21 '15 at 14:11

fedorqui

275,237
103
548
598

1

It works nice! But it doesn't pay attention to the order of the lines (in pattern), and there may be additional characters at the beginning and end of each line. Almost, what I wanted to achieve. – abrzozowski Jul 21 '15 at 14:53
@user51390233 so is the order important? Better update your question with some more relevant sample input. Otherwise my answer is just guessing. – fedorqui Jul 21 '15 at 14:56
@user51390233 I checked my snippet against the files you mention and it works for all cases. – fedorqui Jul 21 '15 at 16:07
You're right. Adding characters at the beginning or end of the line does not pass (**return 1**), but the order of lines is not respected (**return 0** for sample pattern: "222\n333" file: "333\n222" ). – abrzozowski Jul 21 '15 at 17:14
Changing the order of arguments of your command to: `diff pattern <(grep -f pattern file)` protects against wrong line order, but doesn't deal with extra line in the input file, for sample: pattern: "222\n333" file: "222\n555\n333" . – abrzozowski Jul 21 '15 at 18:41
I have an ugly command that protects from line injections: " diff -q pattern <(grep -f pattern file && if [ $((\`grep -n -f pattern file | tail -1 | head -c1\` - \`grep -n -f pattern file | head -c1\` + 1)) -ne \`cat pattern | wc -l\` ]; then echo "diff"; fi) " – abrzozowski Jul 21 '15 at 23:07
@user51390233 I added a completely different approach in another answer. It uses `awk` to handle things better. I think now we got the correct solution! – fedorqui Jul 22 '15 at 11:08

score 3 · Answer 2 · answered Jul 22 '15 at 11:08

3

I went through the problem again and I think awk can handle this better:

awk 'FNR==NR {a[FNR]=$0; next}
     FNR==1 && NR>1 {for (i in a) len++}
     {for (i=last; i<=len; i++) {
         if (a[i]==$0) 
            {last=i; next}
     } status=1}
     END {print status+0}' file pattern

The idea is: - Read all the file file in memory in an array a[line_number] = line. - Count the elements in the array. - Loop through the file pattern and check if the current line occurs in file anytime between where the cursor is and the end of the file file. If it matches, move the cursor to the position where it was found. If it did not, set the status to 1 - that is, there is a line in pattern that did not occur in file after the previous match. - Print the status, that will be 0 unless it was set to 1 anytime before.

Test

They do match:

$ tail f p
==> f <==
222
333
555

==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' f p
0

They don't:

$ tail f p
==> f <==
333
222
555

==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' f p
1

With seq:

$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 2 20) <(seq 10)
1
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 20) <(seq 10)
0

answered Jul 22 '15 at 11:08

fedorqui

275,237
103
548
598

Thank you, that has been very helpful. I can work with that. But it isn't line injection-proof, what mean that someone can add some line in input file (inside pattern occurrence). It will probably need to read the file pattern many times (but i don't know how, close() doesn't work for me) , either load it to a variable as **file**. – abrzozowski Jul 22 '15 at 15:53
Mmmm not sure what you mean. I guess you are asking this as a part of a bigger problem. Maybe clarify that so I can address my answer better. Showing some counter-example could also help. By the way, it is an interesting problem to work on : ) – fedorqui Jul 22 '15 at 15:55
@user51390233: this awk script only reads the target file once (in the `FNR == NR {next}` block. The `next` skips checking the other blocks for that line). Then it goes through the pattern file once, with the other two blocks. – Peter Cordes Jul 22 '15 at 16:09
1

@fedorqui: the 2nd block that counts the length of `a[i]` can be replaced with a `len = length(a)`. (GNU extension). For POSIX, `len = FNR` in the first block, which won't execute for the 2nd file. – Peter Cordes Jul 22 '15 at 16:14
@fedorqui: I think this allows matches with all lines in order, but with other lines mixed in. I think what the OP wants is a `strstr` or `memmem` style search, anchored with newlines, not really a line-based thing at all. Like I tried to do in my answer. – Peter Cordes Jul 22 '15 at 16:19
@fedorqui sorry for my no clear answers. To explain "line injection-proof": 'if (a[i]==$0){last=i; next}' this *check if the current line occurs ... between where the cursor is and the end of file* , but it shouldn't. By steps: 1. found first line (**p[1]**-first line of pattern) in 'file' (e.g found at **f[N]**) 2. check if **f[N+1]** is equivalent of **p[2]** (If yes, go to 3 step. If not go to step 1, again check beginning from first line of pattern / **here i wanted to do the reread pattern by close**) 3. check if **f[N+2]** is equivalent of **p[3]** (If yes ... – abrzozowski Jul 22 '15 at 22:35
@PeterCordes very interesting comments, thanks for them! – fedorqui Jul 23 '15 at 18:32
@user51390233: are you saying his awk program doesn't backtrack properly if a potential match is only rejected after a couple lines? I worried about that, too, since it doesn't seem to be saving the pattern anywhere. IDK, I didn't fully follow the logic, since I figured a real text-search string-in-string function would perform better than a line-based version with manually-implemented backtracking. – Peter Cordes Jul 23 '15 at 19:40
@Peter Cordes yes, this script doesn't even trying backtrack (in orginal version). If i was trying add backtrack with close function, hmm, it didn't work. I don't know why. I think you have a right that better approach is text-search string-in-string. So i have to start learning perl language. – abrzozowski Jul 23 '15 at 21:56
@fedorqui I found next problem with this script. If the file is empty, it always return 0 (matches found). To improve it, i suggest change block condition END to ` END { if (FNR==NR) { status=1 } print status+0; } ` – abrzozowski Jul 23 '15 at 21:56

score 2 · Accepted Answer · edited May 23 '17 at 12:24

I have a working version using perl.

I thought I had it working with GNU awk, but I didn't. RS=empty string splits on blank lines. See the edit history for the broken awk version.

How can I search for a multiline pattern in a file? shows how to use pcregrep, but I can't see a way to get it to work when the pattern to search may contain regex special characters. -F fixed-string mode doesn't usefully work with multi-line mode: it still treats the pattern as a set of lines to be matched separately. (Not as a multi-line fixed-string to be matched.) I see you were already using pcregrep in your attempt.

BTW, I think you have a bug in your code in the non-sudo case:

function writeToFile {
    if [ -w "$1" ] ; then
        "$2" >> "$1"   # probably you mean  echo "$2" >> "$1"
    else
        echo -e "$2" | sudo tee -a "$1" > /dev/null
    fi
}

Anyway, attempts at using line-based tools have met with failure, so it's time to pull out a more serious programming language that doesn't force the newline convention on us. Just read both files into variables, and use a non-regex search:

#!/usr/bin/perl -w
# multi_line_match.pl  pattern_file  target_file
# exit(0) if a match is found, else exit(1)

#use IO::File;
use File::Slurp;
my $pat = read_file($ARGV[0]);
my $target = read_file($ARGV[1]);

if ((substr($target, 0, length($pat)) eq $pat) or index($target, "\n".$pat) >= 0) {
    exit(0);
}
exit(1);

See What is the best way to slurp a file into a string in Perl? to avoid the dependency on File::Slurp (which isn't part of the standard perl distro, or a default Ubuntu 15.04 system). I went for File::Slurp partly for readability of what the program is doing, for non-perl-geeks, compared to:

my $contents = do { local(@ARGV, $/) = $file; <> };

I was working on avoiding reading the full file into memory, with an idea from http://www.perlmonks.org/?node_id=98208. I think non-matching cases would usually still read the whole file at once. Also, the logic was pretty complex for handling a match at the front of the file, and I didn't want to spend a long time testing to make sure it was correct for all cases. Here's what I had before giving up:

#IO::File->input_record_separator($pat);
$/ = $pat;  # pat must include a trailing newline if you want it to match one

my $fh = IO::File->new($ARGV[2], O_RDONLY)
    or die 'Could not open file ', $ARGV[2], ": $!";

$tail = substr($fh->getline, -1);  #fast forward to the first match
#print each occurence in the file
#print IO::File->input_record_separator  while $fh->getline;

#FIXME: something clever here to handle the case where $pat matches at the beginning of the file.
do {
    # fixme: need to check defined($fh->getline)
    if (($tail eq '\n') or ($tail = substr($fh->getline, -1))) {
    exit(0);  # if there's a 2nd line
    }
} while($tail);

exit(1);
$fh->close;

Another idea was to filter patterns and files to be searched through tr '\n' '\r' or something, so they would all be single-lines. (\r being a likely safe choice that wouldn't collide with anything already in a file or a pattern.)

Thank you, this is very useful response for me. I saw that awk is pretty powerful. But that works only for **target filename** without blank lines (by setting **RS=""** / reading target file only to blank line). I don't know how upgrade it. — abrzozowski, Jul 22 '15 at 15:53
oh bother, I misread the man page. You're right, RS="" still splits on blank lines, so it's not a slurp. Probably perl is the way to go, then. — Peter Cordes, Jul 22 '15 at 16:02
This perl version should work. I tested it on a couple inputs, with a pattern of "222\n\n333\n". The logic is dead simple, and doesn't do anything that treats any input as lines, just characters. — Peter Cordes, Jul 22 '15 at 18:23
your perl script looks good. But i am new in perl, so I meet the basic problems with which I can't handle e.g IO::Handle: bad open mode: O_RDONLY. I will take a moment to familiar with this language. I am encouraging the approach to treats input as a characters, it's good way. — abrzozowski, Jul 23 '15 at 21:57
from `read_file`? You're on GNU/Linux with File::Slurp installed? I didn't directly use `IO::File` or `IO::Handle` at all in my code. My code block that starts with `#!/usr/bin/perl -w` is a complete perl program that solves the problem. You're not trying one of the programs from http://www.perlmonks.org/?node_id=98208 are you? — Peter Cordes, Jul 24 '15 at 00:55
yeah, after a long time (learning perl), I run it and it works. Than you! — abrzozowski, Aug 03 '15 at 08:30

How to check if one file is part of other?

3 Answers3

Explanation

Tests

Test

Linked