1

To get content in the body tag, I'm using the code below.

$html = @file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('body');
$body = $nodes->item(0)->nodeValue;

How to remove js codes from the $body? Any js code that will look like

<script> /*Some js code*/ </script>

lomse
  • 4,045
  • 6
  • 47
  • 68

3 Answers3

2

Try this:

$html = preg_replace("/<script.*?\/script>/s", "", $html);

When doing regex things might go wrong, so it's safer to do like this:

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

So that when the "accident" happen, we get the original $html instead of empty string.

Manikiran
  • 2,618
  • 1
  • 23
  • 39
  • This removes only the script tags, but leave the javascript content. The idea is to remove both the script tags and the javascript content. – lomse Jan 03 '16 at 19:43
1

If you already using DOMDocument then why won't you remove nodes with that?!

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile("from_link_to.html");
$scripts = $dom->getElementsByTagName('script');
foreach ($scripts as $script) {
    $scripts->removeChild($script);
}
...

Take closer look to The DOMDocument class and by the way regular expression is nightmare for such tasks.

mkungla
  • 3,390
  • 1
  • 26
  • 37
0

The solution here has fixed my issue. The code below completely removes script tags and their content from the body tag:

$doc = new DOMDocument();
    $doc->preserveWhiteSpace = false;
    @$doc->loadHTML($html);
    $script = $doc->getElementsByTagName('script');

    $remove = [];
    foreach ($script as $item) {
        $remove[] = $item;
    }

    foreach ($remove as $item) {
        $item->parentNode->removeChild($item);
    }

    $node = $doc->getElementsByTagName('body');
    $body = $node->item(0)->nodeValue;

    echo $body;
Community
  • 1
  • 1
lomse
  • 4,045
  • 6
  • 47
  • 68