6

I'd like one or more regexes that can:

1) Take the html of a large page.

2) Find the urls contained in all links, for example:

<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>

And so on, it should extract the url contained in the 'href'attribute regardless of what comes before or after the href

3) Extract the anchor text of all links, for example in the above examples, it should return 'http://example1.com' and the anchor text 'Test 1', then 'http://example2.com' and 'Test 2', and so on.

Ali
  • 261,656
  • 265
  • 575
  • 769
  • 4
    Any reason you dont want to use a DOM Parser for this? And any reason you couldn't find the duplicate? – Gordon Jan 07 '11 at 11:15
  • 1
    possible duplicate of [php regular expression to match specific url pattern](http://stackoverflow.com/questions/2532358/php-regular-expression-to-match-specific-url-pattern) – Gordon Jan 07 '11 at 11:16
  • 1
    possible duplicate of [Regular expression for grabbing the href attribute of an A element](http://stackoverflow.com/questions/3820666/regular-expression-for-grabbing-the-href-attribute-of-an-a-element/3820783#3820783) – Gordon Jan 07 '11 at 11:19
  • 1
    i love how this gets asked a million times every day – ySgPjx Jan 07 '11 at 11:19
  • 1
    possible duplicate of [scrape the data from html page php](http://stackoverflow.com/questions/3369373/scrape-the-data-from-html-page-php/3369474#3369474) – Gordon Jan 07 '11 at 11:22
  • I'm *sure* there's a post on SO about parsing HTML with regexes. Where was it now ... ? – Tim Barrass Jan 07 '11 at 11:36
  • *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Jan 07 '11 at 11:43
  • if you want to do it with regex, have a look at this: http://www.martinwardener.com/regex/ – d7samurai Nov 13 '13 at 21:44

6 Answers6

8
<?

$dom = new DomDocument();
$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');
Oliver O'Neill
  • 1,229
  • 6
  • 11
  • 1
    a lot of people just throw out the "Just use a DOM parser!" But none never show a quick example of what it can do. http://php.net/manual/en/book.dom.php It does a lot more than my example. Worth learning about. – Oliver O'Neill Jan 12 '11 at 04:44
  • 2
    This answer is incomplete, here is one that works http://stackoverflow.com/questions/4423272/how-to-extract-links-and-titles-from-a-html-page-but – giorgio79 Oct 28 '12 at 10:13
5
<?php
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
{ foreach($matches as $match)
{// $match[2] = link address
// $match[3] = link text}
}
?>

This will extract both the link and the anchor text.

jayzantel
  • 91
  • 1
  • 4
5

You need to take a look at look ahead and look behind.

<?php

$string = '<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';

if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
        {
        /*** if we find the word white, not followed by house ***/
        echo 'Found a match';
        print_r($matches);
    }
else
        {
        /*** if no match is found ***/
        echo 'No match found';
        }
?>
Sergi
  • 2,872
  • 2
  • 25
  • 24
  • And of course, the correct way to do this is with the DOM parser, but it's also possible with regex. – Sergi Jan 07 '11 at 13:39
  • See my comment below GameBit's solution. It applies to your Regex as well. – Gordon Jan 07 '11 at 16:30
  • No, it won't break if there're single quotes inside the attributes, just try it. In fact if you use this regex #]*>([^<]*)|]*>([^<]*) |]*>([^<]*)#i or something like that and you discard empty resultsets afterwards, it won't even break if you use single quotes or not quotes at all. The only way to break it is to use < in the anchor text, as I cannot use the look behind with unlimited characters (a PHP regex limitation) to check if it marks the end of the link or it's a single character inside the text – Sergi Jan 08 '11 at 20:58
2

Try something like this:

//not tested
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
Diablo
  • 3,378
  • 1
  • 22
  • 28
2
/<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis
RolandasR
  • 3,030
  • 2
  • 25
  • 26
  • This will break when the attribute value is enclosed in double quotes and contains single quotes. It will also break when quotes are omitted, which would be permissible for an href value like next_page.htm. See http://www.w3.org/TR/html401/intro/sgmltut.html#h-3.2.2 – Gordon Jan 07 '11 at 12:18
  • 1
    this one is pretty robust (test it here http://www.martinwardener.com/regex): `\b(((src|href|action|url) *(=|:) *(?"|'|))(?[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k|url *\( *(?"|'|)(?[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k\))` – d7samurai Nov 13 '13 at 21:49
0

As far as using RegEx to extract links from HTML goes, this one is pretty damn robust:

\b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *\( *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>\))

Here's one that extracts all 'plain' text (i.e. content outside tags) from HTML documents:

(<(?<tag>script|style)[\s\S]*?</\k<tag>>)|<!--[\s\S]*?-->|<[\s\S]*?>|(?<text>[^<>]*)

Test them both here: http://www.martinwardener.com/regex

d7samurai
  • 3,086
  • 2
  • 30
  • 43