preg_replace returning null when input is html (but not all of the time)

Question

I am reading in html from a few different sources which I have to manipulate. As part of this I have a number of preg_replace() calls where I have to replace some of the information within the html received.

On 90% of the sites I have to do this on, everything works fine, the remaining 10% are returning NULL on each of the preg_replace() calls.

I've tried increasing the pcre.backtrack_limit and pcre.recursion_limit based on other articles I've found which appear to have the same problem, but this has been to no avail.

I have output preg_last_error() which is returning '4' for which the PHP documentation isn't proving very helpful at all, so if anyone can shed any light on this it might start to point me in the right direction, but I'm stumped.

One of the offending examples is:

$html = preg_replace('@<script[^>]*?.*?</script>@siu', '', $html);

but as I said, this works 90% of the time.

I don't know how `*?` is interpreted, but it seems redundant (equivalent to `*`, no?). — pascal, Jan 28 '11 at 16:22
@pascal this makes the `*` quantifier ungreedy. (http://php.net/manual/en/regexp.reference.repetition.php) — Arnaud Le Blanc, Jan 28 '11 at 16:23

score 2 · Accepted Answer · answered Jan 28 '11 at 16:18

2

Don't parse HTML with regex. Use a real DOM parser:

$dom = new DOMDocument;
$dom->loadHTML($html);
$scripts = $dom->getElementsByTagName('script');
while ($el = $scripts->item(0)) {
    $el->parentNode->removeChild($el);
}
$html = $dom->saveHTML();

answered Jan 28 '11 at 16:18

lonesomeday

233,373
50
316
318

Ok I'm looking at using DOMDocument and have a basic version working. If I want to amend an element before other children do i need to loop through them all as appendChild() just adds on to the end? – Simon Jan 28 '11 at 18:59
@Simon Not quite sure what you mean by that. I think you may be looking for [`DOMNode::insertBefore`](http://php.net/manual/en/domnode.insertbefore.php). – lonesomeday Jan 28 '11 at 21:16
Yep found it. Pretty much got everything working as before but with the additional sites as well. Thanks for your help everyone. – Simon Jan 28 '11 at 22:32

score 0 · Answer 2 · edited May 23 '17 at 09:58

You have bad utf-8.

/**
 * Returned by preg_last_error if the last error was
 * caused by malformed UTF-8 data (only when running a regex in UTF-8 mode). Available
 * since PHP 5.2.0.
 * @link http://php.net/manual/en/pcre.constants.php
 */
define ('PREG_BAD_UTF8_ERROR', 4);

However, you should really not use regex to parse html. Use DOMDocument

EDIT: Also I don't think this answer would be complete without including You can't parse [X]HTML with regex.

score 0 · Answer 3 · answered Jan 28 '11 at 16:21

0

Your #4 error is a "PREG_BAD_UTF8_ERROR", you should check charset used on sites wich caused this error.

answered Jan 28 '11 at 16:21

soju

25,111
3
68
70

score 0 · Answer 4 · answered Jan 28 '11 at 16:22

0

It is possible that you exceeded backtrack and/or internal recursion limits. See http://php.net/manual/en/pcre.configuration.php

Try this before preg_replace:

ini_set('pcre.backtrack_limit', '10000000');
ini_set('pcre.recursion_limit', '10000000');

answered Jan 28 '11 at 16:22

Arnaud Le Blanc

98,321
23
206
194

preg_replace returning null when input is html (but not all of the time)

4 Answers4