120

Trying to find the links on a page.

my regex is:

/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/

but seems to fail at

<a title="this" href="that">what?</a>

How would I change my regex to deal with href not placed first in the a tag?

bergin
  • 1,584
  • 2
  • 13
  • 21

10 Answers10

220

Reliable Regex for HTML are difficult. Here is how to do it with DOM:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
    echo $dom->saveHtml($node), PHP_EOL;
}

The above would find and output the "outerHTML" of all A elements in the $html string.

To get all the text values of the node, you do

echo $node->nodeValue; 

To check if the href attribute exists you can do

echo $node->hasAttribute( 'href' );

To get the href attribute you'd do

echo $node->getAttribute( 'href' );

To change the href attribute you'd do

$node->setAttribute('href', 'something else');

To remove the href attribute you'd do

$node->removeAttribute('href'); 

You can also query for the href attribute directly with XPath

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
    echo $href->nodeValue;                       // echo current attribute value
    $href->nodeValue = 'new value';              // set new attribute value
    $href->parentNode->removeAttribute('href');  // remove attribute
}

Also see:

On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • Reliable regex for parsing HTML are inherently impossible even since HTML is not a regular language. – Asciiom Oct 10 '13 at 14:11
19

I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :

/^<a.*?href=(["\'])(.*?)\1.*$/

This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? then href= followed by the link surrounded by either " or '

$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);

Output:

array(3) {
  [0]=>
  string(37) "<a title="this" href="that">what?</a>"
  [1]=>
  string(1) """
  [2]=>
  string(4) "that"
}
Toto
  • 89,455
  • 62
  • 89
  • 125
4

The pattern you want to look for would be the link anchor pattern, like (something):

$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
Alex Pliutau
  • 21,392
  • 27
  • 113
  • 143
3

why don't you just match

"<a.*?href\s*=\s*['"](.*?)['"]"

<?php

$str = '<a title="this" href="that">what?</a>';

$res = array();

preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);

var_dump($res);

?>

then

$ php test.php
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(27) "<a title="this" href="that""
  }
  [1]=>
  array(1) {
    [0]=>
    string(4) "that"
  }
}

which works. I've just removed the first capture braces.

Aif
  • 11,015
  • 1
  • 30
  • 44
3

For the one who still not get the solutions very easy and fast using SimpleXML

$a = new SimpleXMLElement('<a href="www.something.com">Click here</a>');
echo $a['href']; // will echo www.something.com

Its working for me

Milan Malani
  • 1,818
  • 1
  • 22
  • 34
2

Quick test: <a\s+[^>]*href=(\"\'??)([^\1]+)(?:\1)>(.*)<\/a> seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'.

The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same.

See live example on: http://www.rubular.com/r/jsKyK2b6do

CharlesLeaf
  • 3,201
  • 19
  • 16
  • 1
    @bergin please specify, what doesn't work? I get the exact value from the href in your test HTML. What are you expecting that this doesn't do? I see you use a different site for testing, there I also get the 'href' value succesfully from your example. http://www.myregextester.com/?r=d966dd6b – CharlesLeaf Sep 29 '10 at 10:30
2

I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var()

If you really need to use a regular expression then check out this tool, it may help: http://regex.larsolavtorvik.com/

Adam
  • 1,098
  • 1
  • 8
  • 17
2

Using your regex, I modified it a bit to suit your need.

<a.*?href=("|')(.*?)("|').*?>(.*)<\/a>

I personally suggest you use a HTML Parser

EDIT: Tested

Ruel
  • 15,438
  • 7
  • 38
  • 49
  • using myregextester.com - sorry, doesnt find the links – bergin Sep 29 '10 at 10:28
  • it says: NO MATCHES. CHECK FOR DELIMITER COLLISION. – bergin Sep 29 '10 at 10:38
  • Can you please tell me the text to match? I use: `what?` – Ruel Sep 29 '10 at 10:41
  • My guess regarding this misunderstanding is that bergin didn't add pattern delimiters to Ruel's answer which does not use pattern delimiters. Without pattern delimiters, the regex engine will assume `<` is the starting delimiter and `>` is the ending delimiter (of course those characters appear in the pattern, so you have "collisions". – mickmackusa Dec 11 '20 at 06:45
0

The following is working for me and returns both href and value of the anchor tag.

preg_match_all("'\<a.*?href=\"(.*?)\".*?\>(.*?)\<\/a\>'si", $html, $match);
if($match) {
    foreach($match[0] as $k => $e) {
        $urls[] = array(
            'anchor'    =>  $e,
            'href'      =>  $match[1][$k],
            'value'     =>  $match[2][$k]
        );
    }
}

The multidimensional array called $urls contains now associative sub-arrays that are easy to use.

Meloman
  • 3,558
  • 3
  • 41
  • 51
  • I find single quotes to be a suboptimal choice for pattern delimiters -- it is so often used for actual quoting of strings that my eye didn't immediately register it as the delimiter. The most common delimiter is probably `/`, but since your pattern used `/`, I might recommend `~`. Because the delimiters are not `/`, you don't need to escape the `/` in your pattern. You also don't need to escaping `<` or `>` because they have no special meaning to the regex engine. – mickmackusa Dec 11 '20 at 06:50
  • like this `"\(.*?)\si"` @mickmackusa ? – Meloman Dec 11 '20 at 09:31
  • No. You mustn't use backslashes as delimiters. Go for forward slashes. – mickmackusa Dec 11 '20 at 09:37
-1

preg_match_all("/(]>)(.?)(</a)/", $contents, $impmatches, PREG_SET_ORDER);

It is tested and it fetch all a tag from any html code.

Ravi Prakash
  • 23
  • 1
  • 7