1

In my code I am trying to fetch entire in HTML codes and ignore all JavaScripts (AdSense Code) from my old website. I have about 800 pages and its hard for me to copy one by one. The main problem I am facing is my Xpath is too long and it gives me an error every time and secondly it only prints the text instead of HTML code. I don't know how to resolve it.

My XPath

/html/body/div/div/div/div[4]/table/tbody/tr/td/div/h2/table/tbody/tr/td/div[1]/table/tbody/tr/td[1]/div/table/tbody/tr/td/div/table/tbody/tr/td/div/table/tbody/tr/td/div

Errors I am getting are available at https://pastebin.com/FFRLr3vq

My Current PHP Code

error_reporting(E_ERROR);
$urls[] = "http://myoldwebsite.com/somepage.html";

function curlload($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
        $source = curl_exec($ch);
        return $source;
}

foreach ($urls as $url) {
$source = curlLoad($url);
@$doc = new DOMDocument();
@$doc->loadHTML($source);   

$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//div[@class='pageContent']");

// To check the result:
echo "<p>" . $node->nodeValue . "</p>";
}
Rtra
  • 514
  • 12
  • 25
  • Does the table have any attributes you an attach onto? Can you please post the table source? That would help me help you better. – IamBatman Sep 11 '17 at 15:25
  • @IamBatman can you please review my update php code – Rtra Sep 11 '17 at 15:27
  • @Rtra offtopic: You should rename your function to `curlLoad` or call it as `curlload` - but don't mix the case. Aswell you should not use `@` to suppress errors. That is bad practice. – Xatenev Sep 11 '17 at 15:29
  • @Rtra ontopic: The errors simply tell you that you are trying to load invalid HTML-markup, meaning the error is not in this code but in the `$source` file. – Xatenev Sep 11 '17 at 15:30
  • @Xatenev thanks for your suggestion I renamed the function – Rtra Sep 11 '17 at 15:31
  • @Xatenev I am getting this error `Fatal error: Call to a member function saveHTML() on null in` when use `$dom->saveHTML($nodeList->item(0));` – Rtra Sep 11 '17 at 15:32
  • @Rtra There is no call to `saveHTML()` in your code. Aswell in your described `error paste` that error is not listed. You should take one step back, think about what you **actually** want to ask, and ask another **proper** question. (But you probably pasted that code and in your code its $doc->saveHTML instead...) – Xatenev Sep 11 '17 at 15:33
  • @Xatenev As I said I want to print HTML instead of PlainText – Rtra Sep 11 '17 at 15:34
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/154152/discussion-between-rtra-and-xatenev). – Rtra Sep 11 '17 at 15:34
  • @Xatenev can we discuss in chat – Rtra Sep 11 '17 at 15:34

1 Answers1

1

To output the loaded HTML you can use

http://php.net/manual/de/domdocument.savehtml.php

To remove script tags (as discussed in the chat), you can use something like that:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

Source & more info: remove script tag from HTML content

Xatenev
  • 6,383
  • 3
  • 18
  • 42