0

I want to strip some html-body code from de full-html code.

I use the script below.

<?php       
    function getbody($filename) {
      $file = file_get_contents($filename);

      $bodystartpattern = ".*<body>";
      $bodyendpattern = "</body>.*";

      $noheader = eregi_replace($bodystartpattern, "", $file);

      $noheader = eregi_replace($bodyendpattern, "", $noheader);

      return $noheader;
    }
    $bodycontent = getbody($_GET['url']);
?>

But in some cases the tag <body> doesn't exist literally, but the tag could be <body style="margin:0;"> or something. Who can tell me what is the solution to find the body-tag in this case by using a regular expression in the $bodystartpattern which looks for the closing-">" of the opening-body-tag?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Guido Lemmens 2
  • 2,317
  • 5
  • 23
  • 29
  • Sidenote: [`eregi_replace()`](http://www.php.net//manual/en/function.eregi-replace.php) This function has been DEPRECATED as of PHP 5.3.0. Relying on this feature is highly discouraged. – Funk Forty Niner Jun 25 '14 at 18:06
  • Check [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454) on using regex to parse html... – Matthew Johnson Jun 25 '14 at 18:12

2 Answers2

2

Why don't you use a html parser ?

function getbody($filename) {
  $file = file_get_contents($filename);

  $dom = new DOMDocument();
  libxml_use_internal_errors(true);
  $dom->loadHTML($file);
  libxml_use_internal_errors(false);
  $bodies = $dom->getElementsByTagName('body');
  assert($bodies->length === 1);
  $body = $bodies->item(0);
  for ($i = 0; $i < $body->children->length; $i++) {
      $body->remove($body->children->item($i));
  }
  $stringbody = $dom->saveHTML($body);
  return $stringbody;
}

DOM loadHTML reference

hlscalon
  • 7,304
  • 4
  • 33
  • 40
  • I've copied your code, but now it returns nothing anymore... any idea? – Guido Lemmens 2 Jun 25 '14 at 19:19
  • @GuidoLemmens2 Do you get any php code inside of it.. more specifically some `$` ? It can broke things up. Do you have error reporting on ? Do you get some response from it ? – hlscalon Jun 25 '14 at 20:03
2

@1nflktd I have tried the code below.

<?php
    header('Content-Type:text/html; charset=UTF-8');

    function getbody($filename) {
        $file = file_get_contents($filename);       
        $dom = new DOMDocument;
        $dom->loadHTML($file);
        $bodies = $dom->getElementsByTagName('body');
        assert($bodies->length === 1);
        $body = $bodies->item(0);
        for ($i = 0; $i < $body->children->length; $i++) {
            $body->remove($body->children->item($i));
        }
        $stringbody = $dom->saveHTML($body);
        return $stringbody;
    }

    $url = "http://www.barcelona.com/";
    $bodycontent = getbody($url);
?>

<html>
<head></head>
<body>
<?php
    echo "BODY ripped from: ".$url."<br/>";
    echo "<textarea rows='40' cols='200' >".$bodycontent."</textarea>";
?>
</body>
</html>
Guido Lemmens 2
  • 2,317
  • 5
  • 23
  • 29
  • I just tried your code in my machine and it worked fine. Are you getting any erros ? If you don't have errors enabled, do it. – hlscalon Jun 30 '14 at 14:04
  • It doesn't work on my machine. You can see this script live at http://www.kunstplantenonline.nl/test/test.php and see the php-warnings. – Guido Lemmens 2 Jun 30 '14 at 22:17
  • Check this http://stackoverflow.com/questions/9149180/domdocumentloadhtml-error, and check my updated answer – hlscalon Jul 01 '14 at 11:57
  • 1
    I have changed "$dom = new DOMDocument;" to "$dom = new DOMDocument();" and it works :-) – Guido Lemmens 2 Jul 01 '14 at 13:22