Extract content from links in a PHP array

Question

How can I extract the content from a list of unknown links? Supose I have this:

<div class="unknown_class">
    <a title="The title x" href="link1.html">This is the content I need 1</a><br>
    <a title="The title y" href="another-link.html">This is the content I need 2</a><br>
    <a title="The title z" href="something-else.html">This is the content I need 3</a><br>
</div>

<a title="The title 0" href="something.html">I dont need this</a>

I think, here the regex could work, but I have no idea how to apply it. :(

This is the result I need:

Array(
    'This is the content I need 1',
    'This is the content I need 2',
    'This is the content I need 3'
)

Any help is appreciated.

The best way is to use a DOM parser. See the linked duplicate. Something like this would work: `$links = $dom->getElementsByTagName('a'); foreach ($links as $link) { $arr[] = $link->nodeValue; }`. — Amal Murali, Jun 29 '14 at 13:54
@AmalMurali, No, the HTML is from an invalid HTML document, DOM parser does not work. I've also updated my answer. There are links that I dont need. The links that I need, have after them the `
` tag. — Andrei Surdu, Jun 29 '14 at 13:56
@Smartik The HTML you've posted looks fairly decent; perhaps you should post a link to the full code? — Ja͢ck, Jun 29 '14 at 13:58
@Jack The full HTML is a complete page. It includes doctype, head, body tags and other inline CSS/JS. DOM parser does not validate it. Andy Truong's answer is what I need. Thank you for attention. — Andrei Surdu, Jun 29 '14 at 14:10
@Smartik It doesn't have to validate it; libxml will try to make sense of the document ... but if you don't wish to post it, that's up to you of course. — Ja͢ck, Jun 29 '14 at 14:12

score 1 · Accepted Answer · answered Jun 29 '14 at 13:56

1

You can use preg_match_all()

$html = '<div class="unknown_class">
    <a title="The title" href="link1.html">This is the content I need 1</a>
    <a title="The title" href="another-link.html">This is the content I need 2</a>
    <a title="The title" href="something-else.html">This is the content I need 3</a>
</div>';

preg_match_all('`<a[^>]+>([^<]+)</a>`', $html, $matches);
print_r($matches[1]);

answered Jun 29 '14 at 13:56

Hong Truong

821
7
11

Thank you. Works exaclty how I want. Accepted. – Andrei Surdu Jun 29 '14 at 14:10

Extract content from links in a PHP array

1 Answers1