I try to map some aweful invalid html-code with php to a xml-structure I need later on. This works quite well, but there is always some part that I just can't handle. So the decission is, do just remove that code so that the xml stays valid. This is how it might look like.
<body>
<item>abc</item>
<item>def</item>
unparsable rest
</body>
So the goal is, to find a solution (probably regex but I'm open to any solution) to just remove the "unparsable rest".
I tried using preg_replace with this regex
/<\/item>(((?!item).)*)\s*<\/body>/iU
And it worked pretty well, matching exactly the part I wanted to have in $1, all the stuff between the last and , but as the xmls are quite large, the calculation just crashes after a couple of milliseconds. I know that regex are not so good doing the negative-lookahead-stuff, but I didn't think it was that bad.
So there needs to be a more efficient solution. Unfortunately I can't use strrpos as there are much more tags after the
text
` (and this is one of the nice mistakes) I would take it :) But loadHtml() has no chance at all. I am actually writing a lot of parsing rules that igonre mistakes, but at some point it would just take 20 hours for one page. So the decission is to just remove the rest. – Ria Weyprecht Sep 13 '12 at 15:30