regular expression to secure email addresses

Question

Lets say there are two default HTML email tags:

<a href="mailto:test@test.com">test@test.com</a>
<a href="mailto:test@test.com" nosecure>test@test.com</a>

I want to find only the email Tag without the nosecure tag in PHP. So something like \<a\b(?![^>]*\bnosecure\b)[^>]*>[^<]*<\/a> will do the trick so far.
But now I want to have one group for the value of the href tag and one group for the text inside the <a>...</a> Tag. Second group is easy:

\<a\b(?![^>]*\bnosecure\b)[^>]*>([^<]*)<\/a>

But how do I get the first group? There can be unlimited other chars after/before the href tag and also the nosecure can be after/before the href tag.
How do I get a regex group for the value of href="mailto:<group>". Also, there can be ' instead of ".

Test cases and my current attempt: https://regex101.com/r/RNEZO3/2

Thanks for any help :)
greetings

Yet another question about difficulties parsing XML/HTML with a regex...Ugh. — Ken White, Apr 05 '17 at 23:40

miken32 · Accepted Answer · 2017-09-13T17:10:36.620

1

Never use regular expressions to parse HTML. Always use a DOM parser! This is easier than you think, just have to learn a bit of XPath to find the attribute (or lack thereof) and the text contents.

<?php
$html = <<< HTML
<div>
<a href="mailto:test@test.com">test@test.com</a>
<a href="mailto:test@test.com" nosecure>test@test.com</a>
</div>
HTML;
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);

/* href attribute */
$result = $xpath->query("//a[not(@nosecure)]/@href");
foreach ($result as $node) {
    echo str_replace("mailto:", "", $node->value);
}

/* text content */
$result = $xpath->query("//a[not(@nosecure)]/text()");
foreach ($result as $node) {
    echo $node->textContent;
}

edited Sep 13 '17 at 17:10

answered Apr 05 '17 at 23:47

miken32

42,008
16
111
154

ok, never heard of this before, but thanks a lot, I will take a look at it and try to learn it. But one question right to the beginning: external processing seems to be possible, but can I manipulate the results directly in the original string with `str_replace`? So that I get a changed `$html` at the end of your example instead of independent outputs? – toddeTV Apr 06 '17 at 00:11
1

Yes you can. You can edit the contents of `$node` and then after you're done use `$dom->saveHTML()` to output the new document. – miken32 Apr 06 '17 at 00:28
To edit when `$node` is an attribute, use `$node->value`, when it's a text node, use `$node->textContent`. – miken32 Apr 06 '17 at 00:40
the idea with DomDocument seems better then with regex expressions. But still the solution is not working yet, so I am still trying with the code and did not closed the stackoverflow tab yet. sry for my slow working. – toddeTV Apr 06 '17 at 15:40
1

No worries, if you run into any more problems, post a new question. You'll definitely get better results parsing the HTML properly, though there is a bit of a learning curve to XPath. And don't rely on regular expressions to validate those email addresses either. Use `filter_var()` instead. – miken32 Apr 06 '17 at 15:59

regular expression to secure email addresses

1 Answers1