1

I have this regex pattern /[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/i that I use to get e-mail addresses from a string. But now I'd like to get only all e-mail addresses that are a value of an arbitrary HTML-element attribute including the attribute itself. Have a look at my example and everything should be clear:

<?php
$subject = 'abc dont@get.me 123 <input value="please@get.me">xyz';
$pattern = '/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/i';
preg_match_all( $pattern, $subject, $matches );
var_dump( $matches );

will produce something like:

array(1) { [0]=> array(2) {
    [0]=> string(11) "dont@get.me"
    [1]=> string(13) "please@get.me"
} }

but I need:

array(1) { [0]=> array(1) {
    [0]=> string(13) "value="please@get.me""
} }

Please be aware that <input value="please@get.me"> is just is an example. I need a pattern that can handle "all" HTML-elements with "all" attributes (I put 'all' in quotes to make clear that I'm aware that there could be some edge cases the pattern could fail because HTML isn't regular) and:

<?php
$subject = "<br data-xyz=please@get.me /> dont@get.me <[tag] [attr]='[pre] andPlease@get.me [ap]'>";
preg_match_all( $pattern, $subject, $matches );
var_dump( $matches );

should produce something like:

array(1) { [0]=> array(2) {
    [0]=> string(13) "data-xyz=please@get.me"
    [1]=> string(13) "[attr]='[pre] andPlease@get.me [ap]'"
} }

To be honest I'm really bad at regex patterns so I don't have a clue about how to achieve it. Hope somebody can help me out with this!


EDIT: Another solution than regex would be also totaly fine!

Axel
  • 3,331
  • 11
  • 35
  • 58
  • Depends on the context. For arbitrary HTML/SGML, a DOM node traversal and tedious attribute check would be most reliable. If it's about existing templates, an unanchored \w+=.+@… might suffice. – mario Feb 16 '19 at 13:55
  • You can use grouping pattern and select all emails in input tag or use lookarounds to ensure only emails are selected that are within input tag. – Pushpesh Kumar Rajwanshi Feb 16 '19 at 13:56
  • Dear @PushpeshKumarRajwanshi thats what I like to do - but how? BTW: It must work wit arbitrary HTML-elements not only input – Axel Feb 16 '19 at 13:57
  • If you use `DOMDocument` and `loadHTML()` and an XPath expression of `//*[contains(@*, "@")]` which will look for any element with an attribute containing an `@` may be a start. – Nigel Ren Feb 16 '19 at 13:59
  • Dear @NigelRen would you mind making a regular answer with a working real world mockup example. would be great! – Axel Feb 16 '19 at 14:04
  • @Axel: Can you check if my answer solves your purpose? – Pushpesh Kumar Rajwanshi Feb 16 '19 at 14:13
  • 1
    Grab all the inputs and check the value against `filter_var($email, FILTER_VALIDATE_EMAIL)` – Hayden Feb 16 '19 at 14:38
  • @Hayden - I've added that to my answer and it solves the problem of finding non-email attributes - thanks! – Nigel Ren Feb 16 '19 at 14:46

2 Answers2

3

To use DOMDocument and XPath to do this, you need to first load the document as HTML and then to use XPath to look for any attributes which contain an '@' symbol....

$subject = 'abc dont@get.me 123 <input value="please@get.me">
          <span t="please@get.me2" u="please@get.me3" />
           <span t="pleasedont get.me" />
        <span t="@@@@">xyz';

$doc = new DOMDocument();
$doc->loadHTML($subject);

$xp = new DOMXPath($doc);
$possibilities = $xp->query('//*/@*[contains(., "@")]');

foreach ( $possibilities as $match )    {
    if ( filter_var($match->nodeValue, FILTER_VALIDATE_EMAIL) ) {
        echo $match->parentNode->nodeName." ".
            $match->nodeName."=". $match->nodeValue.PHP_EOL;
    }
}

(Edit as suggested by Hayden in the comment - I've updated the answer to validate that it is an email address before printing out the values).

will output

input value=please@get.me
span t=please@get.me2
span u=please@get.me3

to break down the XPath...

//*/@*[contains(., "@")]

The //* looks for any node - the /@* means any attribute - the [] expression after it is a condition, so only nodes that match the condition will be returned. The condition contains(., "@") says that the text of the node must contain an @. So put together it says any node with an attribute which contains the @. The $match->nodeValue then will output the value and $match->nodeName will display the attribute name, just added $match->parentNode->nodeName will display the element name as well.

Also note that this method will return multiple matches off the same element but in a different attribute (e.g. please@get.me3).

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
  • Dear Nigel Ren, thank you so much for taking time. I will mark Pushpeshs answer as "accepted" . But yours is also very great (+1) and I'm even not shure which path I will take in the long run. Maybe it will be even yours. But as you know my initial question was about regex. Thank you x 1000 anyway! – Axel Feb 16 '19 at 15:05
  • 1
    Obligatory reference - https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Nigel Ren Feb 16 '19 at 15:07
-1

You can use this regex to ensure matching of any tag name that will contain email if any name of attribute value containing from word \w character,

<\w+.*?([\w-]+=["']*\s*(?:\w+\s*)*[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\s*(?:['"]?(?:\w+\s*)*['"]?)?["']*).*?>

And capture value of first grouping pattern.

Here assumption is tag name and attribute name will be containing characters from \w but in case you want to contain further characters like inclusion of - or . then you will need to change \w to [\w.-] in the regex.

Demo

Edit:

Another way if you do not want to capture data from group1 and instead want full match to only contain attribute name and email, you can use \K operator with this regex,

<\w+.*?\K[\w-]+=["']*\s*(?:\w+\s*)*[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\s*(?:['"]?(?:\w+\s*)*['"]?)?["']*(?=.*?>)

Demo with full match containing the text you want

Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36