Regex to get href value of links that do not have rel='nofollow'

Question

I have a string that contains html link tags and I need to user php preg_match_all to get the href value of the tags, but only if the tag does not have a rel='nofollow' attribute. I found the following expression that gets the href value of all the links.

$regex= "/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU";

How can I modify it to only get the links I want? Here is what it should look like:

$string= "<a href='link1.php'>Link</a>";
$string.= "<a href='link2.php'>Link2</a>";
$string.= "<a href='link3.php' rel='nofollow'>Link3</a>";
$string.= "<a href='link4.php'>Link4</a>";

preg_match_all($regex, $string, $links);

so links should be:

$links[0] => 'link1.php';
$links[1] => 'link2.php';
$links[2] => 'link4.php';

I need the expression to pick up links that use both single and double quotes. Bonus would be to pick up ill formatted but still valid links. If it's not possible to get just the links I want then just a way to find the links I don't want and remove them from the array. Note string is generated dynamically and may not have the same attribute order and will contain other tags and characters besides just the links.

`DOMDocument` is the right tool, not Regular Expressions. – revo Apr 27 '17 at 01:37 — revo, Apr 27 '17 at 01:37

score 4 · Accepted Answer · edited May 23 '17 at 11:54

4

@revo is correct, this is not a job for regular expressions. Use a proper HTML parser to deconstruct the HTML, and then an XPath query to find the information you need.

$html = <<<HTML
<html>
<head>
<title>Example</title>
</head>
<body>
<a href='link1.php'>Link</a>
<a href="link's 2.php" class="link">Link2</a>
<a class="link" href='link3.php' rel='nofollow'>Link3</a>
<a href='link4.php'><span>Link4</span></a>
</body>
</html>
HTML;

$doc = new DOMDocument();
$valid = $doc->loadHTML($html);
$result = [];
if ($valid) {
  $xpath = new DOMXpath($doc);
  // find any <a> elements that do not have a rel="nofollow" attribute,
  // then pick up their href attribute
  $elements = $xpath->query("//a[not(@rel='nofollow')]/@href");
  if (!is_null($elements)) {
    foreach ($elements as $element) {
      $result[] = $element->nodeValue;
    }
  }
}
print_r($result);
# => Array
#    (
#        [0] => link1.php
#        [1] => link's 2.php
#        [2] => link4.php
#    )

edited May 23 '17 at 11:54

Community

1
1

answered Apr 27 '17 at 02:13

Amadan

191,408
23
240
301

Alight. but I'm getting the error: "DOMDocument::loadHTML(): htmlParseStartTag: invalid element name in Entity, line: 1" I assume it's because I'm getting my html wrong. i'm using: "file_get_contents("http://" . $root_path ,"r" )" which worked for regex. – Zaper127 Apr 27 '17 at 02:38
Also possible because I'm using charset="utf-8">. I need a solution for if I don't know the charset of the loaded page. – Zaper127 Apr 27 '17 at 02:47
ugh. It's because I use HTML5. Is there a solution that doesn't require the suppression of error codes? I would rather fix the problem instead of just skip over it. – Zaper127 Apr 27 '17 at 03:03
Sorry, I can't say anything more specific without knowing what's your HTML source; even better, if you can isolate a minimal HTML that still causes that error. – Amadan Apr 27 '17 at 05:18

Regex to get href value of links that do not have rel='nofollow'

1 Answers1