Note: I am assuming throughout this answer that the 2 individual log lines shown in the problem and repeated below do not contain newlines and have been processed through the multiline codec plugin in logstash or removed in some way.
TL;DR The Solution Using a Negative Lookbehind
A negative look behind will work if it is given an appropriate anchor afterwards. Looking at the two lines this would work well:
^(?<!Caused by: )java.*Exception
Note: it could just be ^(?<!Caused by: )j.*Exception
but I think the java
makes it more readable.
Explanation of Problem with Sample Code
The problem with the given regular expressions: ^(?<!Caused by: ).*?Exception
and (?<!^Caused by: ).*?Exception
is the reluctant *?
quantifier that allows something to be matched 0 or more times. Now as explained in this answer the regex engine starts at the beginning of the string and moves left to write. The smallest possible number of characters (since it is reluctant) is nothing but the engine cannot match Exception
and then it incrementally tries to match anything (.
) before Exception
("backtracking") moving left to write.
So the regex engine keeps trying to match one more character at a time (from left to right) until Exception
is found after what is has consumed. Therefore the string
Caused by: java.nio.file.NoSuchFileException: fh-ldap-config/
at com.upplication.s3fs.util.S3Utils.getS3ObjectSummary(S3Utils.java:55)
at com.upplication.s3fs.util.S3Utils.getS3FileAttributes(S3Utils.java:64)
Does match because the engine has consumed everything up to Exception
and Caused by:
doesn't appear before this match. Essentially the .*?
has consumed the Caused by:
that the negative lookbehind is looking for.
Understanding Deeper
To understand what the regex engine is actually doing with lookarounds I recommend viewing this answer
I think it's easy to get caught up by quantifiers and lookarounds and as a general rule I think lookarounds need to be anchored by something concrete (not .
). To understand what I mean let's look at slight variation on the given regex with the greedy *
quantifier . The regex ^(?<!Caused by: ).*Exception
also matches the quoted string.
The reason why is that the greedy *
qualifier starts by consuming the entire string and then backtracks from right to left as explained in the first linked answer above. For the same reason (but from the other side) once the engine matches Exception
it holds everything from the start of the string up to Exception
. It then looks behind what it has consumed and does not find Caused by:
and successfully matches the string.
In Summary, as a General Rule
Always anchor lookarounds when using greedy or reluctant quantifiers.