Parse HTML with PHP's HTML DOMDocument

Question

I was trying to do it with "getElementsByTagName", but it wasn't working, I'm new to using DOMDocument to parse HTML, as I used to use regex until yesterday some kind fokes here told me that DOMEDocument would be better for the job, so I'm giving it a try :)

I google around for a while looking for some explains but didn't find anything that helped (not with the class anyway)

So I want to capture "Capture this text 1" and "Capture this text 2" and so on.

Doesn't look to hard, but I can't figure it out :(

<div class="main">
    <div class="text">
    Capture this text 1
    </div>
</div>

<div class="main">
    <div class="text">
    Capture this text 2
    </div>
</div>

score 55 · Accepted Answer · answered Apr 03 '10 at 12:28

55

If you want to get :

The text
that's inside a <div> tag with class="text"
that's, itself, inside a <div> with class="main"

I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).

Instead, I would use an XPath query on your document, using the DOMXpath class.

For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :

$html = <<<HTML
<div class="main">
    <div class="text">
    Capture this text 1
    </div>
</div>

<div class="main">
    <div class="text">
    Capture this text 2
    </div>
</div>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    var_dump(trim($tag->nodeValue));
}

And executing this gives me the following output :

string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)

answered Apr 03 '10 at 12:28

Pascal MARTIN

395,085
80
655
663

Oh, no wonder google didn't find anything, was looking the wrong thing up. Thats exactly that I needed. I was also wonder of a good way to get test HTML code into a string, but looks like you read my mind and answered that too, thanks :) – Mint Apr 03 '10 at 12:36
2

You're welcome :-) Well, the more I use DOM, the more I love it ;-) Have fun ! – Pascal MARTIN Apr 03 '10 at 12:38
@PascalMARTIN correct me if I'm wrong, but doesn't `DOMDocument->loadHTML()` expect a real HTML document, html, head, body tags and all? – Christian Nov 14 '12 at 10:50
2

@Christian it can load not well-formed HTML *(and works with portions of HTML strings, with no html/body/... tags)* – Pascal MARTIN Nov 15 '12 at 05:25
@PascalMARTIN My bad! That's very useful to know. – Christian Nov 15 '12 at 08:36

score 1 · Answer 2 · edited Jan 22 '16 at 01:18

1

You can use http://simplehtmldom.sourceforge.net/

It is very simple easy to use DOM parser written in php, by which you can easily fetch the content of div tag.

Something like this:

// Find all <div> which have attribute id=text
$ret = $html->find('div[id=text]');

See the documentation of it for more help.

edited Jan 22 '16 at 01:18

donohoe

13,867
4
37
59

answered Mar 12 '14 at 08:16

lokeshsk

443
4
9

Parse HTML with PHP's HTML DOMDocument

2 Answers2

Linked

Related