0

How can I extract the content from a list of unknown links? Supose I have this:

<div class="unknown_class">
    <a title="The title x" href="link1.html">This is the content I need 1</a><br>
    <a title="The title y" href="another-link.html">This is the content I need 2</a><br>
    <a title="The title z" href="something-else.html">This is the content I need 3</a><br>
</div>

<a title="The title 0" href="something.html">I dont need this</a>

I think, here the regex could work, but I have no idea how to apply it. :(

This is the result I need:

Array(
    'This is the content I need 1',
    'This is the content I need 2',
    'This is the content I need 3'
)

Any help is appreciated.

Andrei Surdu
  • 2,281
  • 3
  • 23
  • 32
  • The best way is to use a DOM parser. See the linked duplicate. Something like this would work: `$links = $dom->getElementsByTagName('a'); foreach ($links as $link) { $arr[] = $link->nodeValue; }`. – Amal Murali Jun 29 '14 at 13:54
  • @AmalMurali, No, the HTML is from an invalid HTML document, DOM parser does not work. I've also updated my answer. There are links that I dont need. The links that I need, have after them the `
    ` tag.
    – Andrei Surdu Jun 29 '14 at 13:56
  • @Smartik The HTML you've posted looks fairly decent; perhaps you should post a link to the full code? – Ja͢ck Jun 29 '14 at 13:58
  • @Jack The full HTML is a complete page. It includes doctype, head, body tags and other inline CSS/JS. DOM parser does not validate it. Andy Truong's answer is what I need. Thank you for attention. – Andrei Surdu Jun 29 '14 at 14:10
  • @Smartik It doesn't have to validate it; libxml will try to make sense of the document ... but if you don't wish to post it, that's up to you of course. – Ja͢ck Jun 29 '14 at 14:12

1 Answers1

1

You can use preg_match_all()

$html = '<div class="unknown_class">
    <a title="The title" href="link1.html">This is the content I need 1</a>
    <a title="The title" href="another-link.html">This is the content I need 2</a>
    <a title="The title" href="something-else.html">This is the content I need 3</a>
</div>';

preg_match_all('`<a[^>]+>([^<]+)</a>`', $html, $matches);
print_r($matches[1]);
Hong Truong
  • 821
  • 7
  • 11