1

How can I find a pattern in one file that doesn't match any line of another file

I'm aware that grep has a -f option, so instead of feeding grep a pattern, I can feed it a file of patterns.

(a.a is my main file)

user@system:~/test# cat a.a
Were Alexander-ZBn1gozZoEM.mp4
Will Ate-vP-2ahd8pHY.mp4

(p.p is my file of patterns)

user@system:~/test# cat p.p
ZBn1gozZoEM
0maL4cQ8zuU
vP-2ahd8pHY

So the command might be something like

somekindofgrep p.p a.a

but it should give 0maL4cQ8zuU which is the pattern in the file of patterns, p.p, that doesn't match anything in the file a.a

I am not sure what command to do.

$grep -f p.p a.a<ENTER>
Were Alexander-ZBn1gozZoEM.mp4
Will Ate-vP-2ahd8pHY.mp4
$

I know that if there was an additional line in a.a not matched by any pattern in p.p, then grep -f p.p a.a won't show it. And if I do grep -v -f p.p a.a then it'd only show that line of a.a, not matched in p.p

But i'm interested in finding what pattern in (my file of patterns) p.p doesn't match a.a!

I looked at Make grep print missing queries but he wants everything from both files. And also, one of the answers there mentions -v but I can't quite see that applying to my case because -v shows the lines of a file that don't match any pattern. So having or not having -v won't help me there, because i'm looking for a pattern that doesn't match any line of a file.

Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
barlop
  • 12,887
  • 8
  • 80
  • 109
  • 2
    IMO, your requirement (finding a **pattern** which does not match any line, as opposed to finding a **line** which does not match any pattern) is sufficiently unusual, that `grep` would not offer any builtin-support for it. I would therefore implement a manual solution, for instance looping over all pattern and check each pattern individually. – user1934428 May 27 '22 at 06:05
  • Please read [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern) and then replace the word "pattern" with whatever it is you're trying to match with/on (at least regexp-or-string + line-or-word + full-or-partial) throughout your question. – Ed Morton May 27 '22 at 22:16
  • @EdMorton I know grep and regular expressions, though am not familiar with awk .. and I have done search and replace with regular expressions and grep, and powergrep and some perl one liners. I'm not entirely clear what you're getting at. I don't know what "whatever it is you're trying to match with" means. I know what a pattern is, and what "the data file/stream" is. If you're asking me to replace the pattern with the data stream? The pattern is a pattern!! – barlop May 27 '22 at 22:34
  • The point is that no-one knows what a "pattern" is because it could mean any of a dozen or more different things. Please read the article I referenced as it explains why the term "pattern" shouldn't be used when matching text. – Ed Morton May 27 '22 at 22:50
  • For example you say `How can I find a pattern in one file ...` - should that "pattern" be a regexp match or a string match (e.g. does `t.e` match `the` or not?)? Do you want to match across a whole line or whole word or partial line or partial word? If it's a word - how do you define a "word"? Should the match be case sensitive or not? And so on. – Ed Morton May 27 '22 at 22:53
  • @EdMorton Those objections are irrelevant because they are trivial to adapt. eg if multiline then could swap grep for a command line perl one liner equivalent as perl has a multiline switch, or an alternative to grep called pcregrep mentioned here https://stackoverflow.com/questions/2686147/how-to-find-patterns-across-multiple-lines-using-grep . If "word" boundaries matter then it should be a regex with `\b` rather than a string, if miscellaneous definition of word then a regex specifying whatever white space is wanted. Case sensitivity is a switch, trivial. – barlop May 29 '22 at 06:23
  • @EdMorton so the heart of the question is what the first commenter, "user1934428" picked up on, which is in the question title, and in the question body. – barlop May 29 '22 at 06:24
  • What I posted aren't objections, they're suggestions on how you could improve your question to get the best answer. The idea that you can get solution X for requirements A and then tweak it to work for actual requirements B and still end up with the best solution for requirements B is simply wrong and is often how you end up with Frankencode. – Ed Morton May 29 '22 at 12:35
  • Your question currently reads like you went to an automotive dealership and would only tell them you wanted to buy a "vehicle" and wouldn't tell them if you wanted to buy a car or a truck or a motorbike, whether you want gas or electric or hybrid, etc. Just like "vehicle" in that context, "pattern" is too general a term for us to help you get the best of whatever it is you want. Good luck. – Ed Morton May 29 '22 at 12:39
  • @EdMorton Well it already got answers so clearly they haven't had the issue that you think there is. I will test them when I get back. – barlop May 29 '22 at 12:46
  • Of course it got answers, such questions always get answers as anyone can take a guess at what you might want (especially if they have their own concept of what a "pattern" means to them) and provide a script that does that just like any car salesman can suggest a "vehicle". The answers may even do what you want for some sample input you test it with. That doesn't mean they'll do what you want robustly, efficiently, and/or portably in general, i.e. they may not be the best answer for whatever it is you really are trying to do. – Ed Morton May 29 '22 at 12:56
  • It's also doing a regexp match when, just guessing given the sample input/output you provided, I THINK what you will eventually tell us you need is a string match of some kind. – Ed Morton May 29 '22 at 13:09
  • @EdMorton well, a string match eg grep -F is fine, though for my data a regex is fine too, because none of my patterns contain anything like a dot `.` or a backslash `\\`` or square brackets, `[` that'd have special meaning in a regex – barlop May 29 '22 at 14:03
  • Now we're getting somewhere. In any context, if a string match is "fine" then do a string match since regexp matches are slower (and more susceptible to breakage if/when input contents change) than string matches. That's why `fgrep` and `grep -F`, which do string matches use the character `f` for "fast" (any docs you see saying the "f" stands for "fixed-whatever strings" are retroactively trying to give it that meaning). – Ed Morton May 29 '22 at 14:12
  • Now - what about full vs partial vs anything else? Your strings in `p.p` all appear to always match the strings between the first `-` and the last `.` of the strings in `a.a`. Am I right in thinking that's the part of the `a.a` string that has to match the full string from `p.p`? For example, if `Johnny ZBn1gozZoEM-abcdefghijk.mp4` existed in `a.a`, should that match `ZBn1gozZoEM` from `p.p` or not? – Ed Morton May 29 '22 at 14:14
  • What if `a.a` contained `Johnny foovP-2ahd8pHYxxx.mp4` - should that match `vP-2ahd8pHY` from `p.p` or not? Are the strings in `p.p` all the same length as each other and is that always the sames as the number of chars between the first `-` and the last `.` in `a.a`? e.g. could `a.a` contain `Were Alexander-ZBn1gozZoEMfoobar.mp4` and, if so, should that still match `ZBn1gozZoEM` from `p.p`? – Ed Morton May 29 '22 at 14:19
  • @EdMorton really that pattern string e.g. vP-2ahd8pHY or whatever, is not going to occur more than once on a line (and it would have a dot after it and a dash before it, when it occurs, but if the code matched cases where it doesn't have a dot after it and a dash before it, e,g, your foovP-2ahd8pHYxxx.mp4 example, then it wouldn't matter if you matched it or not, 'cos in my data that wouldn't happen). So if you just matched that fixed string then it'd be fine – barlop May 29 '22 at 14:26
  • @EdMorton also, i'm using the term pattern, even if or even though it's a fixed string, 'cos e.g. I see from an example here https://stackoverflow.com/questions/2502354/what-is-pattern-matching-in-functional-languages that even a case statement is considered to be pattern matching so I suppose matching a fixed string against some data is still considered to be pattern matching? (though granted you are keen on knowing that it is or can be or should be solved with a fixed string) – barlop May 29 '22 at 14:30
  • regexp, string, globbing, full, partial, word, line, etc., etc. are all related to "pattern matching" just like cars, trucks, motorcycles, planes, trains, boats, etc. are all "vehicles". That does not mean you can tell someone you want to "match a pattern" or "buy a vehicle" and they'll be able to accurately help you find the right type of "pattern matching" or "vehicle" for you. It sounds like you're just happy with anything you get regardless of robustness, portability, efficiency, clarity, etc. so - all the best! – Ed Morton May 29 '22 at 14:41
  • @EdMorton I am indeed, though I or anybody viewing answers to this question, would hold in high regard a robust portable efficient clear solution! – barlop May 29 '22 at 15:10
  • 1
    OK, I posted an answer based on what I think you're really trying to do. – Ed Morton May 29 '22 at 15:19

4 Answers4

4

Suggesting awk script that scans a.a once:

script.awk

FNR==NR{wordsArr[$0] = 1; next} # read patterns list from 1st file into array wordsArr
{ # for each line in 2nd file
  for (i in wordsArr){ # iterate over all patterns in array
    if ($0 ~ i) delete wordsArr[i]; # if pattern is matched to current line remove the pattern from array
  }
}
END {for (i in wordsArr) print "Unmatched: " i} # print all patterns left in wordsArray

running: script.awk

awk -f script.awk p.p a.a

Testing:

p.p

aa
bb
cc
dd
ee

a.a

ddd
eee
ggg
fff
aaa

test:

awk -f script.awk p.p a.a
Unmatched: bb
Unmatched: cc
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
3

Home made script:

#!/bin/bash

if [[ $# -eq 2 ]]
then
    patterns="$1"
    mainfile="$2"

    if [[ ! -f "$patterns" ]]
    then
        echo "ERROR: file $patterns does not exist."
        exit 1
    fi
    if [[ ! -f "$mainfile" ]]
    then
        echo "ERROR: file $mainfile does not exist."
        exit 1
    fi
else
    echo "Usage: $0 <PATTERNS FILE> <MAIN FILE>"
    exit 1
fi

while IFS= read -r pattern
do
    if [[ ! grep -q "$pattern" "$mainfile" ]]
    then
        echo "$pattern"
    fi
done < "$patterns"

Like user1934428 suggested, this script loops on the patterns in file p.p and prints out any pattern that is not found in file a.a.

Nic3500
  • 8,144
  • 10
  • 29
  • 40
  • According to Ed, processing a file in a shell loop is "a well known anti pattern" https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice what do you think? I guess it could be slower than other methods that use commands that don't do so.. (e.g. a command that 'uses C code' to processes the file)? – barlop May 29 '22 at 13:08
  • @barlop you don't need to write a C program for this, Unix tools are very efficient at doing this kind of task. – Ed Morton May 29 '22 at 13:11
  • @EdMorton Well, shell tools are probably mostly written in C. (I'm not suggesting a new C program be written.) – barlop May 29 '22 at 13:13
  • @barlop that's completely irrelevant. So are most shells. – Ed Morton May 29 '22 at 13:13
  • @EdMorton It's very relevant As long as the reading of the file is not done in a shell loop, and is done by code that compiled by C, then the speed is much better.. that's I think what your link is getting at re speed. – barlop May 29 '22 at 13:14
  • The shell is written in C. The shell read command is written in C. Using tools that are written in C or not is not the issue. Using a tool that's designed to manipulate text vs using one that isn't is the issue. No, the link I provided is just saying to use a tool that's designed to manipulate text rather than using a shell since it is not, nothing to do with C. – Ed Morton May 29 '22 at 13:17
  • The link also mentions a possible security issue of processing a file in a shell script, I suppose it's talking about the issue of user input in a script file, somebody could add their own code to the script with an injection attack like this example on a windows batch file https://stackoverflow.com/questions/8254286/is-this-batch-file-injection though the file is provided by me so there's no code in the data file! – barlop May 29 '22 at 13:18
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/245146/discussion-between-barlop-and-ed-morton). – barlop May 29 '22 at 13:19
  • the link does go into specifics it doesn't just say e.g. script files are only designed for executing commands in batches and the specifics it goes into, are speed, and security and those could be put down to symptoms of oh script files weren't really designed for that sort of thing. – barlop May 29 '22 at 13:22
  • @EdMorton well, are dudi's and ufo's answers better re speed and security / those criticisms your link mentions re the anti pattern of a shell loop over a file? – barlop May 29 '22 at 13:25
  • Why not test them and see? Testing the speed should be trivial. Having said that, I don't actually think any of the answers do what you really need so how quickly they do it may be a moot point. That link doesn't just mention speed and security, by the way, it also discusses ease of writing robust code, legibility, and portability. – Ed Morton May 29 '22 at 13:34
  • To be clear, I could be completely wrong about the scripts not doing what you want - I simply don't know what you want yet (see [my first comment](https://stackoverflow.com/questions/72400886/in-bash-how-can-i-find-a-pattern-in-one-file-that-doesnt-match-any-line-of-ano/72421184?noredirect=1#comment127920461_72400886) under your question) and so maybe some or all of the answers do exactly what you want. I'm just guessing. You tell me all you need to match are "patterns" and thats enough information to come up with an answer - OK, well all of these answers match "patterns" so I guess they work? – Ed Morton May 29 '22 at 13:41
  • @EdMorton well, the one or two files I have are very small. I suppose in theory I could write some code to generate a large file to compare the speeds of the different solutions.. but this bash script is fine. – barlop May 29 '22 at 13:45
  • My 2 cents, I did not know how to do it in a "one liner" command, and suggested this. I based my loop processing on https://mywiki.wooledge.org/BashFAQ/001. In most similar cases, I do not care about execution speed, but I will have a look at the other answers. :) – Nic3500 May 30 '22 at 00:21
2
# grep p.p pattern in a.a and output pattern 
# if grep is true (pattern matched in a.a)
xargs -i sh -c 'grep -q "{}" a.a && echo "{}"' < p.p
# if grep is false (pattern NOT matched in a.a <--- what you need)
xargs -i sh -c 'grep -q "{}" a.a || echo "{}"' < p.p
ufopilot
  • 3,269
  • 2
  • 10
  • 12
1

Here's a possible solution based on one possible interpretation of what it is you're trying to do (a full-string match on the lines in p.p against the substrings between the first - and the last . in the lines in a.a):

$ awk '
    NR==FNR {
        sub(/[^-]*-/,"")
        sub(/\.[^.]*$/,"")
        file1[$0]
        next
    }
    !($0 in file1)
' a.a p.p
0maL4cQ8zuU

The above will work robustly, portably, and efficiently using any awk in any shell on every Unix box. It'll run orders of magnitude faster than the current shell loop answer, faster than the existing awk answer or the xargs answer, and will work no matter which characters exist in either file, regexp metachars included, and whether or not the search strings from p.p exist as substrings or in other contexts in a.a. It also will have zero security concerns no matter what is in the input files.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185