php get body from html page

Question

I want to strip some html-body code from de full-html code.

I use the script below.

<?php       
    function getbody($filename) {
      $file = file_get_contents($filename);

      $bodystartpattern = ".*<body>";
      $bodyendpattern = "</body>.*";

      $noheader = eregi_replace($bodystartpattern, "", $file);

      $noheader = eregi_replace($bodyendpattern, "", $noheader);

      return $noheader;
    }
    $bodycontent = getbody($_GET['url']);
?>

But in some cases the tag <body> doesn't exist literally, but the tag could be <body style="margin:0;"> or something. Who can tell me what is the solution to find the body-tag in this case by using a regular expression in the $bodystartpattern which looks for the closing-">" of the opening-body-tag?

Sidenote: [`eregi_replace()`](http://www.php.net//manual/en/function.eregi-replace.php) This function has been DEPRECATED as of PHP 5.3.0. Relying on this feature is highly discouraged. — Funk Forty Niner, Jun 25 '14 at 18:06
Check [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454) on using regex to parse html... — Matthew Johnson, Jun 25 '14 at 18:12

hlscalon · Answer 1 · 2014-07-01T14:19:00.530

2

Why don't you use a html parser ?

function getbody($filename) {
  $file = file_get_contents($filename);

  $dom = new DOMDocument();
  libxml_use_internal_errors(true);
  $dom->loadHTML($file);
  libxml_use_internal_errors(false);
  $bodies = $dom->getElementsByTagName('body');
  assert($bodies->length === 1);
  $body = $bodies->item(0);
  for ($i = 0; $i < $body->children->length; $i++) {
      $body->remove($body->children->item($i));
  }
  $stringbody = $dom->saveHTML($body);
  return $stringbody;
}

DOM loadHTML reference

edited Jul 01 '14 at 14:19

answered Jun 25 '14 at 18:15

hlscalon

7,304
4
33
40

I've copied your code, but now it returns nothing anymore... any idea? – Guido Lemmens 2 Jun 25 '14 at 19:19
@GuidoLemmens2 Do you get any php code inside of it.. more specifically some `$` ? It can broke things up. Do you have error reporting on ? Do you get some response from it ? – hlscalon Jun 25 '14 at 20:03

score 2 · Accepted Answer · answered Jun 26 '14 at 00:06

2

@1nflktd I have tried the code below.

<?php
    header('Content-Type:text/html; charset=UTF-8');

    function getbody($filename) {
        $file = file_get_contents($filename);       
        $dom = new DOMDocument;
        $dom->loadHTML($file);
        $bodies = $dom->getElementsByTagName('body');
        assert($bodies->length === 1);
        $body = $bodies->item(0);
        for ($i = 0; $i < $body->children->length; $i++) {
            $body->remove($body->children->item($i));
        }
        $stringbody = $dom->saveHTML($body);
        return $stringbody;
    }

    $url = "http://www.barcelona.com/";
    $bodycontent = getbody($url);
?>

<html>
<head></head>
<body>
<?php
    echo "BODY ripped from: ".$url."<br/>";
    echo "<textarea rows='40' cols='200' >".$bodycontent."</textarea>";
?>
</body>
</html>

answered Jun 26 '14 at 00:06

Guido Lemmens 2

2,317
5
23
29

I just tried your code in my machine and it worked fine. Are you getting any erros ? If you don't have errors enabled, do it. – hlscalon Jun 30 '14 at 14:04
It doesn't work on my machine. You can see this script live at http://www.kunstplantenonline.nl/test/test.php and see the php-warnings. – Guido Lemmens 2 Jun 30 '14 at 22:17
Check this http://stackoverflow.com/questions/9149180/domdocumentloadhtml-error, and check my updated answer – hlscalon Jul 01 '14 at 11:57
1

I have changed "$dom = new DOMDocument;" to "$dom = new DOMDocument();" and it works :-) – Guido Lemmens 2 Jul 01 '14 at 13:22

php get body from html page

2 Answers2