0

I'm using

public function __construct()
{
    $this->EE =& get_instance();
    $regex = '/(\S+@\S+\.\S+)/';
    $replace = '<a href="mailto:$1">$1</a>';


    $this->return_data = preg_replace($regex, $replace, ee()->TMPL->tagdata);
}

to find plain text email address and changing them to mailto links, however, the wysiwyg editor is putting the ending paragraph tag right after the link so it's catching the ending tag and putting it in to the mailto link. I need my regex to exclude anything after the .com or .net or whatever. How would I do this?

Right now, it's returning mailto:email@domain.com

, I need to exlude any and all tags that come after the .com

Here is a part of the dump, This is what's outputted:

<br />
Preston Newbill<br />
Manager<br />
pnewbill@domain.com</p>
nick
  • 59
  • 8

3 Answers3

3

A very basic regex to grab an email address without matching anything HTML tags would be:

[\w\.]+@[\w\.\-]+

Explanation below:

  • \w: stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits
  • \.: escaped dot
  • [\w\.]+: matches any word character and any dot

Unfortunately this doesn't match all possible email addresses. See this question for more details.

A fully RFC‑822–compliant regex (source) would be:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)
Community
  • 1
  • 1
ukliviu
  • 3,136
  • 1
  • 24
  • 30
  • 1
    According to this [**StackOverflow Q&A**](http://stackoverflow.com/questions/7111881/what-are-the-allowed-characters-in-a-sub-domain) and [**RFC-1034**](http://tools.ietf.org/html/rfc1034) letters, numbers, and hyphens are all that are allowed in subdomains/domains portion. So, it seems you could simply do `[A-Za-z0-9\.-]+` on the part following the `@`. +1 as you continue to improve your answer! – crush Sep 13 '13 at 15:26
  • According to [this answer](http://stackoverflow.com/questions/2180465/can-someone-have-a-subdomain-with-an-underscore-in-it) and [RFC-2181](http://www.ietf.org/rfc/rfc2181.txt), underscores are allowed as well. – ukliviu Sep 13 '13 at 15:36
  • 1
    Excellent, so that means you could do `[\w\.-]+` – crush Sep 13 '13 at 15:37
  • The portion before the `@` is the tough part. It can basically have any character if the characters are in between quotation marks. That's why you have a monstrous regex like above. It's usually a better idea to create a function for validating email addresses, rather than using a regular expression for this reason! Your general purpose expression should catch most simpler cases. – crush Sep 13 '13 at 15:43
1

You could try changing your regular expression to the following:

/(\S+@\S+\.[^\<]+)/

This would stop capturing when it encounters the first < at the Top-level domain.

@ukliviu suggests a more restrictive approach that will have even fewer false positives than HTML tags.

crush
  • 16,713
  • 9
  • 59
  • 100
1

Broadly speaking, it is a bad idea to try and mix HTML markup with regex. Your results will vary -- too variation much for a reliable script. If you need to parse HTML, use the HTML parser available right in PHP, DomDocument.

To get RID of HTML is even simpler. You can use strip_tags to remove any and all HTML from the string, even broken markup. Your code could simply be:

$this->return_data = strip_tags(ee()->TMPL->tagdata);

Proof of concept:

$sample1 = 'mailto:email@domain.com</p>';
echo 'dirty: '.htmlentities($sample1).', clean: '.htmlentities(strip_tags($sample1));
// output: dirty: mailto:email@domain.com</p>, clean: mailto:email@domain.com 

See it in action here: http://codepad.viper-7.com/KHsIr0

One function call, no crazy regex to maintain.


Here is an example of how to do this with DomDocument:

// create a new DomDocument object
$doc = new DOMDocument();

// load the HTML into the DomDocument object (this would be your source HTML)
libxml_use_internal_errors(true);
$doc->loadHTML('
    <p>
        <br>
        Preston Newbill<br>
        Manager<br>
        pnewbill@domain.com<br>
        <a href="mailto:noob@aol.com">also email me @ noob@aol.com</a><br>
        Party 9/15/2013@10:00pm!
');
libxml_clear_errors();

// grab the body, recursively check for child nodes. Turn any email addresses into links
$body = $doc->getElementsByTagName('body')->item(0);
checkDomNodeForEmailAddress($body);

// strip off the html,head, and body
$doc->removeChild($doc->firstChild);            
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

die('<hr>final product:'.htmlentities($doc->saveHtml()));

function checkDomNodeForEmailAddress(DOMNode $domNode) {
    foreach ($domNode->childNodes as $node) {
        if($node->hasChildNodes()) {
            if (strtolower($node->nodeName) != 'a')
                checkDomNodeForEmailAddress($node);
        } else {
            $node->nodeValue = preg_replace('/(\S+@\S+\.[^\<]+)/', '<a href="mailto:$1">$1</a>', $node->nodeValue);
        }
    }    
}

Try it here: http://codepad.viper-7.com/EpdBKx

Documentation

Chris Baker
  • 49,926
  • 12
  • 96
  • 115
  • that would also remove ALL tags that is put in by the editor, then it would be one long text string with no formatting. Would it not? – nick Sep 13 '13 at 15:46
  • @nick I had asked what `ee()->TMPL->tagdata` is. If it is the full template's HTML, that's an even greater reason not to use regex. Regex just isn't appropriate for parsing HTML. Yes, this would strip all tags out of whatever you passed it. If you need to deal with the whole editor's text, then you really should use DomDocument. RegEx seems easier on the surface, but you'll just keep finding little edge cases that break your pattern. – Chris Baker Sep 13 '13 at 15:47
  • @Chris is right. You should look into a DOM parser. You could then even do something like `document.getElementsByTagName("a");` which would quickly get you every `a` on the page, and you could easily extract the `href` attribute without needing to do any regular expressions or complicated checking. – crush Sep 13 '13 at 15:52
  • Here's a few trivial examples of realistic text that would get replaced erroneously by the regex in the accepted answer: http://codepad.viper-7.com/wxqOQO – Chris Baker Sep 13 '13 at 15:55
  • I don't understand why I would need a DOM parser. I'm taking plain text and changing it in to a mailto link. I have to maintain the editors tags or none of this would have been an issue. – nick Sep 13 '13 at 15:55
  • I've also changed the accepted answer, crush's answer works perfectly. – nick Sep 13 '13 at 15:56
  • @nick Hmm...I think there is too much we don't know about `ee()->TMPL->tagdata` to explain why a DOM parser might or might not be useful. It might not be useful in this case. I'm not sure what the value of `tagdata` is. – crush Sep 13 '13 at 16:00
  • 1
    @nick Regex is "dumb" (it isn't made to parse HTML), and because HTML is not a regular language, which is what regex good at parsing. With a dom parser, you are working with what you have, which is markup. Consider something like this: http://codepad.viper-7.com/5qXnjR -- oops! – Chris Baker Sep 13 '13 at 16:01
  • Agreed, nick, can you show us what `tagData` is, like a vardump? If there's HTML in the string, it is not "plain text" – Chris Baker Sep 13 '13 at 16:01
  • I'm only catching the plain text that contains an @ and a . to grab all text that may be an email address and then it converts it in to mailto link without the user having to do anything. So a DOM parser wouldn't work until after the links have been created. give me a sec I'll dump the tag and paste it here. But yes, it contains strings. – nick Sep 13 '13 at 16:05
  • @nick If you wish, please update your answer with a `var_dump(ee()->TMPL->tagdata);` so we may offer a more comprehensive, and better, answer! – crush Sep 13 '13 at 16:16
  • @nick I think you're mixing up some jargon here. Any combination of spaces, characters (with or without letters), and numbers is a string. HTML markup is a string, an email address is a string, `abc123` is a string. "Plain text" refers to string data that is usable as-is -- it should not contain code, markup, or other items that need to be parsed or interpreted. If your string has HTML in it, it is not plain text, it is HTML. Please, do post it :) – Chris Baker Sep 13 '13 at 16:22
  • I've posted a small section of the dump, it shows you what it's outputting. – nick Sep 13 '13 at 16:27
  • @nick Here's an example of how you do it with DomDocument: http://codepad.viper-7.com/EpdBKx – Chris Baker Sep 13 '13 at 17:20