4

I would like to extract all the urls and titles from a paragraph of text.

Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.

I am able to get all the href thanks to the following regex, but I don't know how to get in addition, the title between the <a></a> tags ?

preg_match_all('/<a.*href="?([^" ]*)" /iU', $v['message'], $urls);

The best would be to get an associative array like that

[0] => Array
(
   [title] => XXX
   [link] => http://test.com/blop
)
[1] => Array
(
   [title] => XXX
   [link] => http://test.com
)

Thanks for your help

Simon Taisne
  • 658
  • 12
  • 21

5 Answers5

3

If you still insist on using regex to solve this problem you might be able to parse some with this regex:

<a.*?href="(.*?)".*?>(.*?)</a>

Note that it doesn't use the U modifier as your did.

Update: To have it accept single qoutes, as well as double quotes, you can use the following pattern instead:

<a.*?href=(?:"(.*?)"|'(.*?)').*?>(.*?)</a>
Marcus
  • 12,296
  • 5
  • 48
  • 66
3

As has been mentioned in the comments don't use a regular expression but a DOM parser.
E.g.

<?php
$doc = new DOMDocument;
$doc->loadhtml( getExampleData() );

$xpath = new DOMXPath($doc);
foreach( $xpath->query('/html/body/p[@id="abc"]//a') as $node ) {
    echo $node->getAttribute('href'), ' - ' , $node->textContent, "\n";
}

function getExampleData() {
    return '<html><head><title>...</title></head><body>
    <p>
        not <a href="wrong">this one</a> but ....
    </p>
    <p id="abc">
        Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.
    </p>
    </body></html>';
}

see http://docs.php.net/DOMDocument and http://docs.php.net/DOMXPath

VolkerK
  • 95,432
  • 20
  • 163
  • 226
2

You shouldn't use RegEx for this. You should use an XML/DOM parser. I made this quickly using DOMDocument.

$links = array();
$dom = new DOMDocument;
@$dom->loadHTML('Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.');
$xPath = new DOMXPath($dom);
$a = $xPath->query('//a');
for($i=0; $i<$a->length; $i++){
    $e = $a->item($i);
    $links[] = array(
        'title' => $e->nodeValue,
        'link' => $e->getAttribute('href')
    );
}
print_r($links);

DEMO: http://codepad.org/2LEn2CAJ

gen_Eric
  • 223,194
  • 41
  • 299
  • 337
1
preg_match_all("/<a[^>]*href=\"([^\"]*)[^>]*>([^<]*)</a>/", $v['message'], $urls, PREG_SET_ORDER)

should work to give you what you want. It's not an associated array, but it should be a nested array in the format you desire.

GlyphGryph
  • 4,714
  • 4
  • 32
  • 43
0

For people suggesting to use DOM, it might be nice to use DOM. But of course you will not use a FULL DOM parser just to parse couple of urls/titles!

Just use thus regex:

/<a.*href="([^" ]*)".*>(.*)<\/a>/iU
Yousf
  • 3,957
  • 3
  • 27
  • 37
  • 1
    Of course I will use a *full* DOM parser to parse a couple of urls/titles. That's what a DOM parser is for, parsing DOM. – gen_Eric Oct 25 '11 at 13:57