-1

Lets say there are two default HTML email tags:

<a href="mailto:test@test.com">test@test.com</a>
<a href="mailto:test@test.com" nosecure>test@test.com</a>

I want to find only the email Tag without the nosecure tag in PHP. So something like \<a\b(?![^>]*\bnosecure\b)[^>]*>[^<]*<\/a> will do the trick so far.
But now I want to have one group for the value of the href tag and one group for the text inside the <a>...</a> Tag. Second group is easy:

\<a\b(?![^>]*\bnosecure\b)[^>]*>([^<]*)<\/a>

But how do I get the first group? There can be unlimited other chars after/before the href tag and also the nosecure can be after/before the href tag.
How do I get a regex group for the value of href="mailto:<group>". Also, there can be ' instead of ".

Test cases and my current attempt: https://regex101.com/r/RNEZO3/2

Thanks for any help :)
greetings

toddeTV
  • 1,447
  • 3
  • 20
  • 36

1 Answers1

1

Never use regular expressions to parse HTML. Always use a DOM parser! This is easier than you think, just have to learn a bit of XPath to find the attribute (or lack thereof) and the text contents.

<?php
$html = <<< HTML
<div>
<a href="mailto:test@test.com">test@test.com</a>
<a href="mailto:test@test.com" nosecure>test@test.com</a>
</div>
HTML;
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);

/* href attribute */
$result = $xpath->query("//a[not(@nosecure)]/@href");
foreach ($result as $node) {
    echo str_replace("mailto:", "", $node->value);
}

/* text content */
$result = $xpath->query("//a[not(@nosecure)]/text()");
foreach ($result as $node) {
    echo $node->textContent;
}
miken32
  • 42,008
  • 16
  • 111
  • 154
  • ok, never heard of this before, but thanks a lot, I will take a look at it and try to learn it. But one question right to the beginning: external processing seems to be possible, but can I manipulate the results directly in the original string with `str_replace`? So that I get a changed `$html` at the end of your example instead of independent outputs? – toddeTV Apr 06 '17 at 00:11
  • 1
    Yes you can. You can edit the contents of `$node` and then after you're done use `$dom->saveHTML()` to output the new document. – miken32 Apr 06 '17 at 00:28
  • To edit when `$node` is an attribute, use `$node->value`, when it's a text node, use `$node->textContent`. – miken32 Apr 06 '17 at 00:40
  • the idea with DomDocument seems better then with regex expressions. But still the solution is not working yet, so I am still trying with the code and did not closed the stackoverflow tab yet. sry for my slow working. – toddeTV Apr 06 '17 at 15:40
  • 1
    No worries, if you run into any more problems, post a new question. You'll definitely get better results parsing the HTML properly, though there is a bit of a learning curve to XPath. And don't rely on regular expressions to validate those email addresses either. Use `filter_var()` instead. – miken32 Apr 06 '17 at 15:59