Regexp for extracting all links and anchor texts from HTML

Question

I'd like one or more regexes that can:

1) Take the html of a large page.

2) Find the urls contained in all links, for example:

<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>

And so on, it should extract the url contained in the 'href'attribute regardless of what comes before or after the href

3) Extract the anchor text of all links, for example in the above examples, it should return 'http://example1.com' and the anchor text 'Test 1', then 'http://example2.com' and 'Test 2', and so on.

Any reason you dont want to use a DOM Parser for this? And any reason you couldn't find the duplicate? — Gordon, Jan 07 '11 at 11:15
possible duplicate of [php regular expression to match specific url pattern](http://stackoverflow.com/questions/2532358/php-regular-expression-to-match-specific-url-pattern) — Gordon, Jan 07 '11 at 11:16
possible duplicate of [Regular expression for grabbing the href attribute of an A element](http://stackoverflow.com/questions/3820666/regular-expression-for-grabbing-the-href-attribute-of-an-a-element/3820783#3820783) — Gordon, Jan 07 '11 at 11:19
possible duplicate of [scrape the data from html page php](http://stackoverflow.com/questions/3369373/scrape-the-data-from-html-page-php/3369474#3369474) — Gordon, Jan 07 '11 at 11:22
I'm *sure* there's a post on SO about parsing HTML with regexes. Where was it now ... ? — Tim Barrass, Jan 07 '11 at 11:36
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Jan 07 '11 at 11:43
if you want to do it with regex, have a look at this: http://www.martinwardener.com/regex/ — d7samurai, Nov 13 '13 at 21:44

score 8 · Accepted Answer · answered Jan 07 '11 at 15:20

8

<?

$dom = new DomDocument();
$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');

answered Jan 07 '11 at 15:20

Oliver O'Neill

1,229
6
11

1

a lot of people just throw out the "Just use a DOM parser!" But none never show a quick example of what it can do. http://php.net/manual/en/book.dom.php It does a lot more than my example. Worth learning about. – Oliver O'Neill Jan 12 '11 at 04:44
2

This answer is incomplete, here is one that works http://stackoverflow.com/questions/4423272/how-to-extract-links-and-titles-from-a-html-page-but – giorgio79 Oct 28 '12 at 10:13

score 5 · Answer 2 · answered Dec 09 '13 at 12:26

5

<?php
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
{ foreach($matches as $match)
{// $match[2] = link address
// $match[3] = link text}
}
?>

This will extract both the link and the anchor text.

answered Dec 09 '13 at 12:26

jayzantel

91
1
4

I use this one, because it only takes 54ms for 4MB file instead of 10-30 seconds with real parsers :) – KoalaBear Jan 24 '17 at 21:56
Really a great work just one regex and all work done. Learnt new way today. – kanudo Mar 26 '17 at 14:19

score 5 · Answer 3 · answered Jan 07 '11 at 13:36

5

You need to take a look at look ahead and look behind.

<?php

$string = '<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';

if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
        {
        /*** if we find the word white, not followed by house ***/
        echo 'Found a match';
        print_r($matches);
    }
else
        {
        /*** if no match is found ***/
        echo 'No match found';
        }
?>

answered Jan 07 '11 at 13:36

Sergi

2,872
2
25
24

And of course, the correct way to do this is with the DOM parser, but it's also possible with regex. – Sergi Jan 07 '11 at 13:39
See my comment below GameBit's solution. It applies to your Regex as well. – Gordon Jan 07 '11 at 16:30
No, it won't break if there're single quotes inside the attributes, just try it. In fact if you use this regex #]*>([^<]*)|]*>([^<]*) |]*>([^<]*)#i or something like that and you discard empty resultsets afterwards, it won't even break if you use single quotes or not quotes at all. The only way to break it is to use < in the anchor text, as I cannot use the look behind with unlimited characters (a PHP regex limitation) to check if it marks the end of the link or it's a single character inside the text – Sergi Jan 08 '11 at 20:58

score 2 · Answer 4 · answered Jan 07 '11 at 11:17

2

Try something like this:

//not tested
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";

answered Jan 07 '11 at 11:17

Diablo

3,378
1
22
28

This wouldnt match second and third link in OP's given example markup. – Gordon Jan 07 '11 at 12:20

score 2 · Answer 5 · answered Jan 07 '11 at 11:18

2

/<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis

answered Jan 07 '11 at 11:18

RolandasR

3,030
2
25
26

This will break when the attribute value is enclosed in double quotes and contains single quotes. It will also break when quotes are omitted, which would be permissible for an href value like next_page.htm. See http://www.w3.org/TR/html401/intro/sgmltut.html#h-3.2.2 – Gordon Jan 07 '11 at 12:18
1

this one is pretty robust (test it here http://www.martinwardener.com/regex): `\b(((src|href|action|url) *(=|:) *(?"|'|))(?[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k|url *$ *(?"|'|)(?[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k$)` – d7samurai Nov 13 '13 at 21:49

score 0 · Answer 6 · answered Nov 13 '13 at 21:55

As far as using RegEx to extract links from HTML goes, this one is pretty damn robust:

Here's one that extracts all 'plain' text (i.e. content outside tags) from HTML documents:

(<(?<tag>script|style)[\s\S]*?</\k<tag>>)||<[\s\S]*?>|(?<text>[^<>]*)

Test them both here: http://www.martinwardener.com/regex

Regexp for extracting all links and anchor texts from HTML

6 Answers6

Linked