0

I am trying get a specific div element (i.e. with attribute id="vung_doc") from a website, but I get almost every element. Do you have any idea what's wrong?

$doc = new DOMDocument;

// We don't want to bother with white spaces
$doc->preserveWhiteSpace = true;

// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;

$doc->loadHTMLFile('http://lightnovelgate.com/chapter/epoch_of_twilight/chapter_300');

$xpath = new DOMXPath($doc);

$query = "//*[@class='vung_doc']";


$entries = $xpath->query($query);
var_dump($entries->item(0)->textContent);
Sᴀᴍ Onᴇᴌᴀ
  • 8,218
  • 8
  • 36
  • 58

2 Answers2

0

Change

$query = "//*[@class='vung_doc']";

to

$query = "//*[@id='vung_doc']";
Halfstop
  • 1,710
  • 17
  • 34
0

Actually, it appears that that one element, which has both id and class attributes with value vung_doc, has many paragraphs inside its text content. Perhaps you are thinking each paragraph should be in its own div element.

<div id="vung_doc" class="vung_doc" style="font-size: 18px;">
    <p></p>
    "Mayor song..."

In the screenshot at the bottom of this post, I added an outline style to that element, to show just how many paragraphs are within that element.

If you wanted to separate the paragraphs, you could use preg_split() to split on any new line characters:

$entries = $xpath->query($query);

foreach($entries as $entry) {
    $paragraphs = preg_split("/[\r\n]+/s",$entry->textContent);
    foreach($paragraphs as $paragraph) {
        if (trim($paragraph)) {            
            echo '<b>paragraph:</b> '.$paragraph;
            break;
        }
    }
}

See a demonstration of this in this playground example. Note that before loading the HTML file, libxml_use_internal_errors() is called, to suppress the XML errors:

libxml_use_internal_errors(true);

Screenshot of the target div element with outline added:

screenshot

Sᴀᴍ Onᴇᴌᴀ
  • 8,218
  • 8
  • 36
  • 58