Increase regex performance when matching only if not in html attribute

Question

TYPO3 blocks rendering of pages with more than 13 content-elements when using the Typoscript:

brandReplacing {
    stdWrap {
        replacement {
            10 {
                search = ®(?=[<]*(?:<[^>]*>[^<]*)*$)
                replace = <sup>®</sup>
                useRegExp = 1
            }
        }
    }
}

as the regex even needs 118 Steps for this short example (steps needed increases exponentially / like 83000 steps needed for two more attributes). All in all the regex works fine but is too "huge".

Does anybody have an idea how to reduce the steps (performance) needed to execute the regex and maybe also exclude ®-Symbols already wrapped with <sup>-tags? Or is there a better way to solve this problem TYPO3-sided?

The regex like above:

®(?=[<]*(?:<[^>]*>[^<]*)*$)

The html code:

<img title="Copyright replacement incorrect ®" src="/fileadmin/filexyz.png">
<h1>Copyright replacement correct: ®</h1>
Also correct replacement here: ®
Maybe NOT here: <sup>®</sup>

Try with a possessive quantifier: `®(?=[^<>]*+(?:<[^>]*>[^<]*)*$)` — Wiktor Stribiżew, Sep 14 '16 at 09:22
Wow! Awesome, this was like 75% performance increasing. Do you also have an idea how to exclude already wrapped `®`-Symbols? Otherwise i will accept that answer ;) — Y.Hermes, Sep 14 '16 at 09:27
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Jost, Sep 14 '16 at 09:43
As far as I tested that with "real" data, the steps just reduced from 275809 to 273256... — Y.Hermes, Sep 14 '16 at 09:45
@Jost, I am not parsing the html, this is a TYPO3 function that (i think) just parses each content element's content. — Y.Hermes, Sep 14 '16 at 09:49
I would like to but can't because of privacy reasons... But i'm trying to find something similar. — Y.Hermes, Sep 14 '16 at 09:50
A dummy similar input with same amount of characters is enough. — revo, Sep 14 '16 at 09:51
Just did it now with the content of this page :D Old Regex: https://regex101.com/r/jC0aE8/2 New Regex: https://regex101.com/r/jC0aE8/1 On this page the old is just pretty faster?! — Y.Hermes, Sep 14 '16 at 09:53
I just looked it up what the real problem was: The page where the rendering was blocked contained 12 tables with several rows and columns. As far as i think the rendering broke down because he couldn't handle that much steps (tried it in regex101 and also there it broke down) — Y.Hermes, Sep 14 '16 at 10:00
Do not measure the performance with the number of steps at regex101. The real performance can only be tested in the target environment. Try a PCRE `(?:^{®<\/sup>|<[^>]*>)(*SKIP)(*F)|®` regex in TYPO3.} — Wiktor Stribiżew, Sep 14 '16 at 10:05
@Y.Hermes: You *are* trying to parse HTML with a regex: You task is to find (R) in text nodes, but only if it's not the sole content of a `^{`-tag. That's clearly out of scope for regexes and the source where your problems stem from. A correct way to do this would be to filter your HTML through a PHP script that uses [`DOMDocument`](http://php.net/manual/en/refs.xml.php) and related functionality to parse the HTML, walk all text nodes (except those that are children of `sup` nodes), and replace the (R) within those text nodes. Then serialize the result to HTML and return it/print it.} — Jost, Sep 14 '16 at 10:14
@Jost This is far of my level of skill :D This, of course, sounds like a good idea but im not experienced enough to hook myself into the rendering of TYPO3. In this case (tested it) Wiktor's idea works as it should and is pretty much just simpler! ;) As far as it just works, the answer I was looking for seems to be Wiktor's one :p — Y.Hermes, Sep 14 '16 at 10:19

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

I didn't analyse your regex to find exactly what is going on, but obviously backtracks are eating up resources. There are other ways to put an end to a pattern to prevent it to go further: Following the rules.

There should be some rules for defining attributes and their placeholders. I come with two rules which you may add to them later (see like when ® is inside an attribute value):

There may be another attr = " value " immediately to be followed by current one. Matching continues till engine sees a > without jumping over any < or >: [^"<>]*"(\s\s*[\w-]+="[^<>]*>
Or it may be last attribute that is going to reach a closing brace >: \/?>

Regex:

(®)(?!([^"<>]*"(\s\s*[\w-]+="[^<>]*>|\s*\/?>)))

Live demo

If you compare it to both your Regular Expressions you will notice how fast it is. Captured ®s have the same exact positions as previous ones. It hardly matches wrong positions.

Failing cases:

It doesn't match a ®:

If it is immediately followed by " string="...> within a literal text. E.g:

<div>® character" some-chars-here="...."  /></div>

If it is immediately followed by a " /> or " > within a literal text. E.g:

<div>®   "        /></div>

I think it rarely happens really.

Increase regex performance when matching only if not in html attribute

1 Answers1

Failing cases: