0
<li class="zk_list_c2 f_l"><a title="abc" target="_blank" href="link">
                                        abc
                                    </a>&nbsp;</li>

how would i extract abc and link?

$pattern="/<li class=\"zk_list_c2 f_l\"><a title=\"(.*)\" target=\"_blank\" href=\"(.*)\">\s*(.*)\s*<\/a>&nbsp;<\/li>/m";
preg_match_all($pattern, $content, $matches);

the one i have right now doesnt seems to work

Brad Mace
  • 27,194
  • 17
  • 102
  • 148
hao
  • 57
  • 2
  • 6

1 Answers1

9

Considering your are trying to extract some data from an HTML string, regex are generally not the right/best tool for the job.

Instead, why no use a DOM parser, like the DOMDocument class, provided with PHP, and its DOMDocument::loadHTML method ?

Then, you could navigate through your HTML document using DOM methods -- which is much easier than using regex, especially considering than HTML is not quite regular.


Here, for example, you could use something like this :

$html = <<<HTML
<li class="zk_list_c2 f_l"><a title="abc" target="_blank" href="link">
        abc
    </a>&nbsp;</li>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);
$as = $dom->getElementsByTagName('a');
foreach ($as as $a) {
    var_dump($a->getAttribute('href'));
    var_dump(trim($a->nodeValue));
}

And you would get the following output :

string(4) "link"
string(3) "abc"


The code is not quite hard, I'd say, but, in a few words, here what it's doing :

Just a note : you might want to check if the href attribute exists, with DOMElement::hasAttribute, before trying to use its value...


EDIT after the comments : here's a quick example using DOMXpath to get to the links ; I supposed you want the link that's inside the <li> tag with class="zk_list_c2 f_l" :

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$as = $xpath->query('//li[@class="zk_list_c2 f_l"]/a');

foreach ($as as $a) {
    var_dump($a->getAttribute('href'));
    var_dump(trim($a->nodeValue));
}

And, again, you get :

string(4) "link"
string(3) "abc"


As you can see, the only thing that changes is the way you're using to get to the right <a> tag : instead of DOMDocument::getElementsByTagName, it's just a matter of :

Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • There are more than one
  • I am trying to extract, so I cant really use DOMDocument class and the html is more complex than that.
  • – hao Mar 27 '10 at 16:45
  • @hao : then, maybe an XPath query, instead of getElementsByTagName, could do the trick ? *(see http://www.php.net/manual/en/domxpath.query.php )* ;; anyway, I would really not use regex for this kind of data-extraction. – Pascal MARTIN Mar 27 '10 at 16:47
  • You can set up a loop that goes over the array that DOMDocument::getElementsByTagName('li') returns, and extract the data using the above method IF (class == "zk_list_c2 f_l") – Powertieke Mar 27 '10 at 17:05
  • how would i only get the a tags inside of
  • $lis = $dom->getElementsByTagName('li'); foreach ($lis as $li) { if($li->getAttribute('class')=="zk_list_c2 f_l") { echo (trim($li->nodeValue))."
    "; } this doesnt display the a tags inside, is there a way for me to extract everything inside of the li?
  • – hao Mar 27 '10 at 17:14
  • The `nodeValue` property only contains the content of the node (here, your `
  • ` tag) itself, a not the content of its children -- to get the content of the `` tag, you have to access it ;; I suppose you could do that with the `childNodes` property of your `$li`, to loop over the children of `
  • ` ; but, if I had to choose, I would rather prefer Xpath -- I've edited my answer with an example.
  • – Pascal MARTIN Mar 27 '10 at 17:20