0

In my PCRE regular expression I used an atomic group to reduce backtracks.

<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b

REGEX101

But in the example debug, I see that after f checking ends, it goes further for other options. I'm trying to stop it after f check fails so it doesn't check the rest of expression. What's wrong?

HBasiri
  • 343
  • 1
  • 6
  • 16
  • 2
    Wow, that is one long pattern for such a short string. Can you provide more examples of strings you want to match? – Paolo Aug 25 '18 at 19:55
  • @UnbearableLightness: I'm trying to match any HTML tags in a string. Match steps are fine but it's so many steps for checking other tags(etc XML) which aren't HTML tags(GET) – HBasiri Aug 26 '18 at 16:09
  • This is because regex performance is important for me – HBasiri Aug 26 '18 at 16:14

2 Answers2

2

I will assume you know what you're doing by using regex here, since there's probably an argument to be made that PCRE is not the best approach to implementing this sort of matching in a "tree"-like fashion. But I'm not fussed about that.

The idea of using conditionals isn't bad, but it adds extra steps in the form of the conditions themselves. Also, you can only branch off in two directions per conditional.

PCRE has a feature called "backtracking control verbs" which allow you to do precisely what you want. They have varying levels of control, and the one I would suggest in this case is the strongest:

<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(*COMMIT)(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b

https://regex101.com/r/p572K8/2

Just by adding a single (*COMMIT) verb after the 'f' branch, it's cut the number of steps required to find a failure in this case by half.

(*COMMIT) tells the engine to commit to the match at that point. It won't even re-attempt the match starting from </ again if no match is found.

To fully optimize the expression, you'll have to add (*COMMIT) at every point after branching has occurred.

Another thing you can do is try to re-order your alternatives in such a way as to prioritize those that are found most commonly. That might be something else to consider in your optimization process.

jaytea
  • 1,861
  • 1
  • 14
  • 19
0

Because that's how atomic group works. The idea is:

at the current position, find the first sequence that matches the pattern inside atomic grouping and hold on to it. (Source: Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?)

So if there is no match inside an atomic group, it will iterate through all options. You can use conditionals instead:

</?\s*\b(?(?=a)a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|(?(?=b)b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|(?(?=c)c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|(?(?=d)d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|(?(?=e)em(?:bed)?|(?(?=f)f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|(?(?=h)h(?:[1-6r]|ead(?:er)?|tml)|(?(?=i)i(?:frame|mg|nput|ns)?|(?(?=k)kbd|(?(?=l)l(?:abel|egend|i(?:nk)?)|(?(?=m)m(?:a(?:in|p|rk)|et(?:a|er))|(?(?=n)n(?:av|o(?:frames|script))|(?(?=o)o(?:bject|l|pt(?:group|ion)|utput)|(?(?=p)p(?:aram|icture|re|rogress)?|(?(?=q)q|(?(?=r)r[pt]|(?(?=r)ruby|(?(?=s)s|(?(?=s)s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|(?(?=t)t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|(?(?=u)ul?|(?(?=v)v(?:ar|ideo)|wbr))))))))))))))))))))))\b

Regex101

CrafterKolyan
  • 1,042
  • 5
  • 13