efficiently remove substring that may not contain a specific word -

Question

I try to map some aweful invalid html-code with php to a xml-structure I need later on. This works quite well, but there is always some part that I just can't handle. So the decission is, do just remove that code so that the xml stays valid. This is how it might look like.

<body>
    <item>abc</item>
    <item>def</item>
    unparsable rest
</body>

So the goal is, to find a solution (probably regex but I'm open to any solution) to just remove the "unparsable rest".

I tried using preg_replace with this regex

/<\/item>(((?!item).)*)\s*<\/body>/iU

And it worked pretty well, matching exactly the part I wanted to have in $1, all the stuff between the last and , but as the xmls are quite large, the calculation just crashes after a couple of milliseconds. I know that regex are not so good doing the negative-lookahead-stuff, but I didn't think it was that bad.

So there needs to be a more efficient solution. Unfortunately I can't use strrpos as there are much more tags after the

Over-simplifying, you can check for `.*?<` Basically anything malformed between the closing tag of an element and the next (valid) opening tag. Though this kind of interrogation should be done via a parser not brute-forcing it with patterns. (Not saying you're married to patterns, but a parser would be a better bet). — Brad Christie, Sep 13 '12 at 15:23
but a parser might not parse invalid stuff, that's exactly my problem :) — Ria Weyprecht, Sep 13 '12 at 15:24
and `.*?<` will match the first closing item, not the last one as it doesn't know, that there shouldn't be another item in the text i want to have — Ria Weyprecht, Sep 13 '12 at 15:25
I'm not saying anything out-of-box, I'm saying you're better off writing a parser. You mentioned this is the result of a conversion previously. The previous conversion sounds like it needs to be worked on, not the band aid for where it failed. — Brad Christie, Sep 13 '12 at 15:25
Well if you tell me a HTML-Parser that can deal with things like `
text
` (and this is one of the nice mistakes) I would take it :) But loadHtml() has no chance at all. I am actually writing a lot of parsing rules that igonre mistakes, but at some point it would just take 20 hours for one page. So the decission is to just remove the rest. — Ria Weyprecht, Sep 13 '12 at 15:30
Have you tried the ones found on [this question](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) to see which gets you closest? (This is assuming the refactoring the original PHP code that's causing the malformed code isn't an option). — Brad Christie, Sep 13 '12 at 15:34
the malformed html is an export of a website being edited by people with no idea what they are doing that should be moved to another cms. so no, i can't change anything on the html or the things that are producing it :) I'll take a look at all those different parsers. But still, it's not a solution for Stuff it just can't handle (what my actual question was) — Ria Weyprecht, Sep 13 '12 at 15:39
Please check my answer, it is more of academic interest, I suppose. — Wiktor Stribiżew, Feb 04 '20 at 14:10
The problem is over 8 years old... The project is finished since pretty much 7.5 years. I didn't need it anymore and am not even working in the same company anymore to check it. — Ria Weyprecht, Jan 08 '21 at 21:29

score 1 · Answer 1 · answered Jan 23 '20 at 08:54

You have a tempered greedy token in your regex pattern. It is slow by its nature, see the "Performance Issue" section in the answer I link to.

So, your current regex, that I prefer to write without U and with s modifier as ~</item>(((?!item).)*?)\s*?</body>~is, matches your input string within 231 steps.

Note there is no much semantic difference in \s* and \s*? here since there is no other quantified pattern before </body>. \s*, the greedy pattern, is preferred in such cases.

Let's unroll the pattern and replace ((?!item).)*? with [^i]*(?:i(?!tem)[^i]*)*. The ~</item>([^i]*(?:i(?!tem)[^i]*)*)\s*</body>~is matches your input within 117 steps.

This is still quite a lot for the string. The whitespace after </item> can be matched possessively with \s*+ to cut the backtracking access to that part of string. The ~</item>\s*+([^i]*(?:i(?!tem)[^i]*)*)\s*</body>~is shows an improvement, now it takes 89 steps to match the string, and only unparsable rest lands in Group 1 value.

Unfortunately, we cannot play with backtracking much here since you want to cut off trailing whitespace from Group 1 value.

If you want to match all between </item> and </body> that does not contain <item> inside, the pattern will look like ~</item>\s*+([^<]*(?:<(?!item>)[^<]*)*)\s*</body>~is, see the regex demo.

score 0 · Answer 2 · answered Jan 23 '20 at 10:14

Check each line to start with '<' and end with '>':

$t ='<body>
    <item>abc</item>
    <item>def</item>
    unparsable rest
</body>';

// break the string into lines
$filtered = array_filter(explode("\n", $t), function($line) {
    // each line
    $line = trim($line); //ignore white spaces
    return $line[0] == '<' && substr($line, -1) == '>';
});
// rebuild the string
$result = implode("\n", $filtered);
echo $result;

Demo: https://3v4l.org/Mt5eG

efficiently remove substring that may not contain a specific word -

2 Answers2