0

This is a bit peculiar, but the application that I'm using only implements a limited subset of Regular Expression parameters. Most significantly, it doesn't support ANY group options other than basic parenthesis grouping ( i.e. no named groups, no look-aheads, or look-behinds, not even (?:) ).

I'm only looking to get a TRUE/FALSE match, I don't need to extract, replace or parse the data, I just need to know if the specified pattern occurs in the subject string, yes or no.

So I'm trying to build out a pattern using basic Regex that will trigger a match upon finding 3 of 4 provided terms, from a block of text, and which may be in any order.

To wit, I tried this: (\b(Term1|Term2|Term3|Term4)\b.*?){3,} but it doesn't work. Weirdly, if I change {3,} to {1,} it finds all instances of every term, indicating that the pattern DOES work, but when I tell it I only want a match if there are 3 or more instances, it doesn't find any of them. This remains true even when I try the pattern on Regex101, so it doesn't seem to be a failure of the limited engine within my app.

All of the words in the subject text are preceded and followed by at least one space, or a period and do report matches when the pattern /b(TermX)/b is used. In one piece of sample data, when the quantifier is converted to {1,} 9 matches are found, but when changed to {3,} ZERO are!

What am I missing / misunderstanding about this pattern?

Edit: Going back to the limited feature-set that I'm dealing with: I am only able to specify the pattern. As far as I can tell, there is no provided mechanism for specifying the conditions (i.e. Case Insensitive, Global, MultiLine, SingleLine etc.) and it appears that none of them are set by default.

Edit: Sample data, by request...

<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head>
<body>
<p><span style="text-decoration: underline;"><strong>EXPIRATION
NOTIFICATION&nbsp;FOR</strong></span><br>somedomain.com</p>
<p>Your domain service account is pending cancellation.</p>
<p></p>
<p>Notice#: xxx-xxx<br>Date: 04.24.2019<br>EXPIRATION DATE:
05.02.2019</p>
<p></p>
<p><b>Follow up on:<br></b><b><span style="text-decoration:
underline;"><span style="color: #ff6600;"><a href="https://spamurl.com"><span style="color: #ff6600; text-decoration: underline;">Secure Online
Payment</span></a></span></span><br></b><b>to complete.</b></p>
<p></p>
<p>Domain: somedomain.com<br>Registration Period: 1 Year/s<br>Amount:
$86.00 USD<br>Status: Pending (Unpaid)</p>
<p></p>
<p></p>
<p>Dear Name Lastname|CompanyName, ,<br>We are reaching out
to let you know that your notice #xxx-xxx&nbsp;is 5 days overdue.<br>We
are keeping your service for somedomain.com online, as your are still within
our grace period, and we want ensure the best possible service for
you.</p>
<p>Your account is in danger of being suspended if we do not receive your
payment soon. Please pay your notice here to avoid service
disruption.&nbsp;</p>
<p></p>
<p><b>Follow up on:<br></b><b><a href="https://spamurl.com"><span style="text-decoration: underline;"><span style="color: #ff6600;"><span style="color: #ff6600; text-decoration: underline;">Secure Online
Payment</span></span></span><br></a></b><b>to complete.</b></p>
<p></p>
<p><span style="font-size: xx-small; color: #c0c0c0;">Instructions and
Unlike Instructions from this Newsletter:</span><br><span style="font-size: xx-small; color: #c0c0c0;">This Email contains
information intended only for the individuals or entities to which it is
addressed. If you are not the intended recipient or the agent responsible
for delivering it to the intended recipient, or have received this Email in
error, please notify immediately the sender of this Email at the Help
Center and then completely delete it. Any other action taken in reliance
upon this Email is strictly prohibited, including but not limited to
unauthorized copying, printing, disclosure, or distribution. We do not
directly register or renew domain names. This is not a bill or an invoice.
This is a optimization offer for your website. You are under no obligation
to pay the amount stated unless you accept this purchase offer. Promotional
material is strictly along the guidelines oft he can-spam act of 2003. They
are in no way misleading. You have received this message because you
elected to receive notification offers. Thank you for your
cooperation.&nbsp;Unsubscribe Domain Service renew <span style="text-decoration: underline;"><a href="https://spamurl.com"><span style="color: #c0c0c0; text-decoration:
underline;">here</span></a></span>.</span></p>
<img src="https://spamurl.com" height="1" width="10"></body>
</html>
NetXpert
  • 511
  • 5
  • 14
  • 1
    Can you post a regex101 link, or an example text that isn't matching as it should? It seems to work here: https://regex101.com/r/AbcfXk/1 – CertainPerformance Apr 24 '19 at 23:09
  • Here's a link to a regex101 page with sample data and the pattern. The application for this is to provide a filter to an email preprocessing application. We're getting hammered by emails like these daily coming from random foreign domains, and random IP addresses, so we're down to trying to implement a regex filter to eliminate them in the DMZ. It's the Email preprocessor that has the restricted Regex engine. https://regex101.com/r/UGlBrg/2 – NetXpert Apr 24 '19 at 23:21
  • 2
    Possible duplicate of [How do I match any character across multiple lines in a regular expression?](https://stackoverflow.com/questions/159118/how-do-i-match-any-character-across-multiple-lines-in-a-regular-expression) https://regex101.com/r/UGlBrg/3 – CertainPerformance Apr 24 '19 at 23:26
  • @CertainPerformance -- thanks for that, and, on Regex101, it appears to work, frustratingly, I can't set the Regex options, I can only supply the pattern itself, *and* the engine doesn't appear to recognize ```[\s]``` or ```[\S]``` – NetXpert Apr 24 '19 at 23:42
  • Put your sample data and the pattern here, in the question. While regex101 is a useful site, we should not be forced to go there to get the information for your post. Also, the site guidelines require all of the relevant information to be here in the question itself, so that it is preserved for future readers; content in off-site locations can end up deleted or moved or the site can go away, meaning the content there is no longer available and your question has no meaning. – Ken White Apr 24 '19 at 23:46
  • 1
    Then use the method in the second answer there – CertainPerformance Apr 24 '19 at 23:50
  • @CertainPerformance - already tried it; sadly, it didn't help. Now it just *always* matches; no matter how many instances are requested (there are only 9 matches in the sample, but it will match if 10 are requested, and then crashes the engine at 11+) https://regex101.com/r/UGlBrg/4 ```(\b(Name|notice|cancellation|domain)(.|[\r\n])*?){10,}``` – NetXpert Apr 25 '19 at 00:00
  • Okay, I'm giving up on this, with all the constraints, it just seems doomed to fail. I put in a basic pattern to try and match the particular url patterns they're using, but am unhappy about the higher possibility of false-positives, and/or the relative ease with which it can be bypassed by small changes. I just can't give this problem a whole afternoon of time. Thanks for all the help / insights @CertainPerformance, I marked up your suggestions as being the most helpful. – NetXpert Apr 25 '19 at 00:31
  • Still, yeah, a pure regex doesn't seem like the right efficient tool for this, better to find all series' of word characters, and then check for how many of those words match the desired ones you're looking for, in whatever programming language – CertainPerformance Apr 25 '19 at 00:35
  • Uhm, when it's enclosed in the square brackets, "." goes from "match all characters" to "match a literal period (.) character", which doesn't work at all...? Unfortunately, I don't have access to the underlying code. I'm stuck just giving the application patterns to match in incoming emails and which direct it to reject/discard messages whenever the provided pattern(s) match. It's both a powerful feature, and a frustrating one at the same time. – NetXpert Apr 25 '19 at 00:35
  • Oops, of course, you're right. Maybe the engine doesn't support `\s`, but what about `\w`, `\d`? Either of those in the character set with their opposite should do the trick, eg `[\w\W]` to match anything, including newlines. I wonder if you could exploit ranges somehow, eg https://regex101.com/r/UGlBrg/6 – CertainPerformance Apr 25 '19 at 00:38

0 Answers0