22

I was trying to do it with "getElementsByTagName", but it wasn't working, I'm new to using DOMDocument to parse HTML, as I used to use regex until yesterday some kind fokes here told me that DOMEDocument would be better for the job, so I'm giving it a try :)

I google around for a while looking for some explains but didn't find anything that helped (not with the class anyway)

So I want to capture "Capture this text 1" and "Capture this text 2" and so on.

Doesn't look to hard, but I can't figure it out :(

<div class="main">
    <div class="text">
    Capture this text 1
    </div>
</div>

<div class="main">
    <div class="text">
    Capture this text 2
    </div>
</div>
Mint
  • 14,388
  • 30
  • 76
  • 108

2 Answers2

55

If you want to get :

  • The text
  • that's inside a <div> tag with class="text"
  • that's, itself, inside a <div> with class="main"

I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).

Instead, I would use an XPath query on your document, using the DOMXpath class.


For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :

$html = <<<HTML
<div class="main">
    <div class="text">
    Capture this text 1
    </div>
</div>

<div class="main">
    <div class="text">
    Capture this text 2
    </div>
</div>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);


And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    var_dump(trim($tag->nodeValue));
}


And executing this gives me the following output :

string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)
Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • Oh, no wonder google didn't find anything, was looking the wrong thing up. Thats exactly that I needed. I was also wonder of a good way to get test HTML code into a string, but looks like you read my mind and answered that too, thanks :) – Mint Apr 03 '10 at 12:36
  • 2
    You're welcome :-) Well, the more I use DOM, the more I love it ;-) Have fun ! – Pascal MARTIN Apr 03 '10 at 12:38
  • @PascalMARTIN correct me if I'm wrong, but doesn't `DOMDocument->loadHTML()` expect a real HTML document, html, head, body tags and all? – Christian Nov 14 '12 at 10:50
  • 2
    @Christian it can load not well-formed HTML *(and works with portions of HTML strings, with no html/body/... tags)* – Pascal MARTIN Nov 15 '12 at 05:25
  • @PascalMARTIN My bad! That's very useful to know. – Christian Nov 15 '12 at 08:36
1

You can use http://simplehtmldom.sourceforge.net/

It is very simple easy to use DOM parser written in php, by which you can easily fetch the content of div tag.

Something like this:

// Find all <div> which have attribute id=text
$ret = $html->find('div[id=text]'); 

See the documentation of it for more help.

donohoe
  • 13,867
  • 4
  • 37
  • 59
lokeshsk
  • 443
  • 4
  • 9