2

I want a PHP regex that can find errors on a page. So when I visit a site and crawl the page that I can list the errors on the site.

Currently I have the following code:

preg_match('/<b>.+<\/b>:.+ in <b>\/.+<\/b> on line <b>[0-9]+<\/b><br( \/)?>/msi',$html,$errors);

It can show if errors occurred, but it will not list them! I get the full html page in the array ($errors[0])

Could anybody help?

EDIT: So I have a page with for example the following HTML-source, from which I want to extract the PHP errors:

<b>Warning</b>:  session_start() [<a href='function.session-start'>function.session-start</a>]: The session id contains invalid characters, valid characters are only a-z, A-Z and 0-9 in <b>/home/.../public_html/articlescript/init.php</b> on line <b>127</b><br />
<br />
<b>Warning</b>:  session_start() [<a href='function.session-start'>function.session-start</a>]: Cannot send session cache limiter - headers already sent (output started at /home/.../public_html/articlescript/init.php:127) in <b>/home/.../public_html/articlescript/init.php</b> on line <b>127</b><br />
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<head>
    <title>...
Simon
  • 5,464
  • 6
  • 49
  • 85
  • Could you please provide more information about the entire scenario? – Gordon Oct 08 '10 at 16:23
  • I'm not sure what you plan on using this for, but you should be aware that PHP can be (and often is) configured to display errors (when they are even displayed) in different ways. You can't rely on client-side methods to detect server-side errors. – Brad Oct 08 '10 at 16:35
  • In most cases they are displayed this way, and I'm aware that they can be turned off. I just want to check a page CLIENT-SIDE if there are errors like these. Nowhere I could find a regex that works for this case! – Simon Oct 08 '10 at 16:38
  • Why is normal error handling not an option? http://www.php.net/manual/en/intro.errorfunc.php – stevendesu Oct 08 '10 at 16:54
  • Because it's an external site. – Simon Oct 08 '10 at 16:59

5 Answers5

5

Since – well, you know – you shouldn’t use regular expressions to parse HTML, try this using PHP’s DOM library:

libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($str);
$messages = array();
foreach ($doc->getElementsByTagName('b') as $elem) {
    if (in_array($elem->textContent, array('Error', 'Warning', 'Notice'))) {
        $buffer = $elem->textContent;
        while ($elem->nextSibling !== null && strtolower($elem->nextSibling->localName) !== 'br') {
            $elem = $elem->nextSibling;
            $buffer .= $elem->textContent;
        }
        $messages[] = $buffer;
    }
}

This will search for B elements that’s content is one of “Error”, “Warning”, or “Notice” and take the textual contents from there up to the next BR element. The initial call of libxml_use_internal_errors will prevent that parsing errors will be reported.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • This works not entirely, how can I let this work the same as http://ideone.com/utL3K? – Simon Oct 08 '10 at 16:58
  • 1
    @Kevin: Ok, I have to admit that this might fail if the document is actually invalid HTML and is fragmented in such a way that parsing fails. – Gumbo Oct 08 '10 at 17:10
  • No, it just does not list the errors correct. The while doesn't work. If I just delete the while it will list the errors... But not the texts – Simon Oct 08 '10 at 17:12
2

Forgive my language but it's quite foolish to attempt to parse HTML with regular expressions, especially potentially-malformed HTML. Use an HTML parsing library instead.

For HTML parsing and validation in HTML, I would refer to this answer; also check out the tidy extension.

Community
  • 1
  • 1
Ether
  • 53,118
  • 13
  • 86
  • 159
  • 2
    Well, in this case the HTML isn't really XML compliant, and moreover you can't really know where this error will show up so an XML parser (or HTML for what it worth) won't help. – Colin Hebert Oct 08 '10 at 16:39
  • 1
    @Colin: there are HTML parsers that will identify errors, which is precisely what the OP wants to do. HTML is not regular, so using a regular expression will not be fruitful. – Ether Oct 08 '10 at 16:44
  • That comment must be one of the most-linked ones here. – CanSpice Oct 08 '10 at 16:55
  • @Kevin: I've edited my answer with the best links I could find. – Ether Oct 08 '10 at 17:16
1

Remember to escape your \ in strings.

preg_match_all('#<b>(.+?)</b>:(.+?) in <b>(.+?)</b> on line <b>([0-9]+)</b><br(?: /)?>#is',$string,$errors);

This code on ideone

Colin Hebert
  • 91,525
  • 15
  • 160
  • 151
0

Put brackets () around the bits of regex that you want to be stored in $errors.
You'll also want to use preg_match_all() rather then preg_match().

chigley
  • 2,562
  • 20
  • 18
0

If this is your own website you can either: set the log levels and parse your log files (easier) or run your scripts from the command line with php -l.

supakeen
  • 2,876
  • 19
  • 19