Regex to find all URL and titles

Question

I would like to extract all the urls and titles from a paragraph of text.

Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.

I am able to get all the href thanks to the following regex, but I don't know how to get in addition, the title between the <a></a> tags ?

preg_match_all('/<a.*href="?([^" ]*)" /iU', $v['message'], $urls);

The best would be to get an associative array like that

[0] => Array
(
   [title] => XXX
   [link] => http://test.com/blop
)
[1] => Array
(
   [title] => XXX
   [link] => http://test.com
)

Thanks for your help

For the umptillionth time on this site, don't use regex to parse/handle HTML. Use DOM instead. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Marc B, Oct 24 '11 at 16:14
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Alex Turpin, Oct 24 '11 at 16:14
[See](http://stackoverflow.com/questions/7878604/regex-to-find-all-url-and-titles) — Emmanuel N, Oct 24 '11 at 16:15
@Rocket I meant [this](http://stackoverflow.com/questions/1449618/how-to-find-a-url-from-a-content-by-php) — Emmanuel N, Oct 24 '11 at 16:26

Marcus · Accepted Answer · 2014-02-04T10:04:13.297

3

If you still insist on using regex to solve this problem you might be able to parse some with this regex:

<a.*?href="(.*?)".*?>(.*?)</a>

Note that it doesn't use the U modifier as your did.

Update: To have it accept single qoutes, as well as double quotes, you can use the following pattern instead:

<a.*?href=(?:"(.*?)"|'(.*?)').*?>(.*?)</a>

edited Feb 04 '14 at 10:04

answered Oct 24 '11 at 16:20

Marcus

12,296
5
48
66

1

`preg_match_all('#(.*?)#i', $v['message'], $matches);` – ghbarratt Oct 24 '11 at 17:02
This will fail when double quotes of href container will be single quotes – Thompson Feb 04 '14 at 09:44
@MohanSinfh updated it with support for both. Though I still suggest that you should use a real DOM parser instead of regex – Marcus Feb 04 '14 at 10:05

score 3 · Answer 2 · answered Oct 24 '11 at 16:22

As has been mentioned in the comments don't use a regular expression but a DOM parser.
E.g.

<?php
$doc = new DOMDocument;
$doc->loadhtml( getExampleData() );

$xpath = new DOMXPath($doc);
foreach( $xpath->query('/html/body/p[@id="abc"]//a') as $node ) {
    echo $node->getAttribute('href'), ' - ' , $node->textContent, "\n";
}

function getExampleData() {
    return '<html><head><title>...</title></head><body>
    <p>
        not <a href="wrong">this one</a> but ....
    </p>
    <p id="abc">
        Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.
    </p>
    </body></html>';
}

see http://docs.php.net/DOMDocument and http://docs.php.net/DOMXPath

score 2 · Answer 3 · answered Oct 24 '11 at 16:21

You shouldn't use RegEx for this. You should use an XML/DOM parser. I made this quickly using DOMDocument.

$links = array();
$dom = new DOMDocument;
@$dom->loadHTML('Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.');
$xPath = new DOMXPath($dom);
$a = $xPath->query('//a');
for($i=0; $i<$a->length; $i++){
    $e = $a->item($i);
    $links[] = array(
        'title' => $e->nodeValue,
        'link' => $e->getAttribute('href')
    );
}
print_r($links);

DEMO: http://codepad.org/2LEn2CAJ

GlyphGryph · Answer 4 · 2011-10-24T17:35:20.687

1

preg_match_all("/<a[^>]*href=\"([^\"]*)[^>]*>([^<]*)</a>/", $v['message'], $urls, PREG_SET_ORDER)

should work to give you what you want. It's not an associated array, but it should be a nested array in the format you desire.

edited Oct 24 '11 at 17:35

answered Oct 24 '11 at 16:23

GlyphGryph

4,714
4
32
43

score 0 · Answer 5 · answered Oct 24 '11 at 16:24

0

For people suggesting to use DOM, it might be nice to use DOM. But of course you will not use a FULL DOM parser just to parse couple of urls/titles!

Just use thus regex:

/<a.*href="([^" ]*)".*>(.*)<\/a>/iU

answered Oct 24 '11 at 16:24

Yousf

3,957
3
27
37

1

Of course I will use a *full* DOM parser to parse a couple of urls/titles. That's what a DOM parser is for, parsing DOM. – gen_Eric Oct 25 '11 at 13:57

Regex to find all URL and titles

5 Answers5