Remove closing paragraph tag after email address with regex

Question

I'm using

public function __construct()
{
    $this->EE =& get_instance();
    $regex = '/(\S+@\S+\.\S+)/';
    $replace = '<a href="mailto:$1">$1</a>';


    $this->return_data = preg_replace($regex, $replace, ee()->TMPL->tagdata);
}

to find plain text email address and changing them to mailto links, however, the wysiwyg editor is putting the ending paragraph tag right after the link so it's catching the ending tag and putting it in to the mailto link. I need my regex to exclude anything after the .com or .net or whatever. How would I do this?

Right now, it's returning mailto:email@domain.com

, I need to exlude any and all tags that come after the .com

Here is a part of the dump, This is what's outputted:

<br />
Preston Newbill<br />
Manager<br />
pnewbill@domain.com</p>

So don't use a WYSIWYG, hard code it. I don't use those myself. — Funk Forty Niner, Sep 13 '13 at 15:09
Maybe you could show exactly what's coming in to the regex. Is it `email@domain.com` — Andy Gee, Sep 13 '13 at 15:09
No, that took care of it. /(\S+@\S+\.[^\<]+)/ you're my new hero — nick, Sep 13 '13 at 15:15
The `[^\<]+` says capture 1 or more characters excluding `<`. When it encounters the `<` it will stop capturing because the input will no longer match the pattern. — crush, Sep 13 '13 at 15:19
@PeterAlfvin It would've but it was too restrictive. Domains can have many characters in them besides alpha-numeric. So, now, I just look for an opening tag, and quit capturing there. — crush, Sep 13 '13 at 15:22
@crush `\S' doesn't match all the legal characters in a domain name? — Peter Alfvin, Sep 13 '13 at 15:23
@PeterAlfvin it matches anything but a whitespace. I think I should revise the regex further. — crush, Sep 13 '13 at 15:24
What exactly is `ee()->TMPL->tagdata`? Is it an string of HTML? — Chris Baker, Sep 13 '13 at 15:34

score 3 · Answer 1 · edited May 23 '17 at 12:28

A very basic regex to grab an email address without matching anything HTML tags would be:

[\w\.]+@[\w\.\-]+

Explanation below:

\w: stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits
\.: escaped dot
[\w\.]+: matches any word character and any dot

Unfortunately this doesn't match all possible email addresses. See this question for more details.

A fully RFC‑822–compliant regex (source) would be:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

According to this [**StackOverflow Q&A**](http://stackoverflow.com/questions/7111881/what-are-the-allowed-characters-in-a-sub-domain) and [**RFC-1034**](http://tools.ietf.org/html/rfc1034) letters, numbers, and hyphens are all that are allowed in subdomains/domains portion. So, it seems you could simply do `[A-Za-z0-9\.-]+` on the part following the `@`. +1 as you continue to improve your answer! — crush, Sep 13 '13 at 15:26
According to [this answer](http://stackoverflow.com/questions/2180465/can-someone-have-a-subdomain-with-an-underscore-in-it) and [RFC-2181](http://www.ietf.org/rfc/rfc2181.txt), underscores are allowed as well. — ukliviu, Sep 13 '13 at 15:36
The portion before the `@` is the tough part. It can basically have any character if the characters are in between quotation marks. That's why you have a monstrous regex like above. It's usually a better idea to create a function for validating email addresses, rather than using a regular expression for this reason! Your general purpose expression should catch most simpler cases. — crush, Sep 13 '13 at 15:43

crush · Accepted Answer · 2013-09-13T15:28:05.047

1

You could try changing your regular expression to the following:

/(\S+@\S+\.[^\<]+)/

This would stop capturing when it encounters the first < at the Top-level domain.

@ukliviu suggests a more restrictive approach that will have even fewer false positives than HTML tags.

edited Sep 13 '13 at 15:28

answered Sep 13 '13 at 15:16

crush

16,713
9
59
100

Chris Baker · Answer 3 · 2013-09-13T17:22:19.233

1

Broadly speaking, it is a bad idea to try and mix HTML markup with regex. Your results will vary -- too variation much for a reliable script. If you need to parse HTML, use the HTML parser available right in PHP, DomDocument.

To get RID of HTML is even simpler. You can use strip_tags to remove any and all HTML from the string, even broken markup. Your code could simply be:

$this->return_data = strip_tags(ee()->TMPL->tagdata);

Proof of concept:

$sample1 = 'mailto:email@domain.com</p>';
echo 'dirty: '.htmlentities($sample1).', clean: '.htmlentities(strip_tags($sample1));
// output: dirty: mailto:email@domain.com</p>, clean: mailto:email@domain.com

See it in action here: http://codepad.viper-7.com/KHsIr0

One function call, no crazy regex to maintain.

Here is an example of how to do this with DomDocument:

// create a new DomDocument object
$doc = new DOMDocument();

// load the HTML into the DomDocument object (this would be your source HTML)
libxml_use_internal_errors(true);
$doc->loadHTML('
    <p>
        <br>
        Preston Newbill<br>
        Manager<br>
        pnewbill@domain.com<br>
        <a href="mailto:noob@aol.com">also email me @ noob@aol.com</a><br>
        Party 9/15/2013@10:00pm!
');
libxml_clear_errors();

// grab the body, recursively check for child nodes. Turn any email addresses into links
$body = $doc->getElementsByTagName('body')->item(0);
checkDomNodeForEmailAddress($body);

// strip off the html,head, and body
$doc->removeChild($doc->firstChild);            
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

die('<hr>final product:'.htmlentities($doc->saveHtml()));

function checkDomNodeForEmailAddress(DOMNode $domNode) {
    foreach ($domNode->childNodes as $node) {
        if($node->hasChildNodes()) {
            if (strtolower($node->nodeName) != 'a')
                checkDomNodeForEmailAddress($node);
        } else {
            $node->nodeValue = preg_replace('/(\S+@\S+\.[^\<]+)/', '<a href="mailto:$1">$1</a>', $node->nodeValue);
        }
    }    
}

Try it here: http://codepad.viper-7.com/EpdBKx

Documentation

strip_tags - http://php.net/manual/en/function.strip-tags.php
DomDocument - http://php.net/manual/en/class.domdocument.php

edited Sep 13 '13 at 17:22

answered Sep 13 '13 at 15:45

Chris Baker

49,926
12
96
115

that would also remove ALL tags that is put in by the editor, then it would be one long text string with no formatting. Would it not? – nick Sep 13 '13 at 15:46
@nick I had asked what `ee()->TMPL->tagdata` is. If it is the full template's HTML, that's an even greater reason not to use regex. Regex just isn't appropriate for parsing HTML. Yes, this would strip all tags out of whatever you passed it. If you need to deal with the whole editor's text, then you really should use DomDocument. RegEx seems easier on the surface, but you'll just keep finding little edge cases that break your pattern. – Chris Baker Sep 13 '13 at 15:47
@Chris is right. You should look into a DOM parser. You could then even do something like `document.getElementsByTagName("a");` which would quickly get you every `a` on the page, and you could easily extract the `href` attribute without needing to do any regular expressions or complicated checking. – crush Sep 13 '13 at 15:52
Here's a few trivial examples of realistic text that would get replaced erroneously by the regex in the accepted answer: http://codepad.viper-7.com/wxqOQO – Chris Baker Sep 13 '13 at 15:55
I don't understand why I would need a DOM parser. I'm taking plain text and changing it in to a mailto link. I have to maintain the editors tags or none of this would have been an issue. – nick Sep 13 '13 at 15:55
I've also changed the accepted answer, crush's answer works perfectly. – nick Sep 13 '13 at 15:56
@nick Hmm...I think there is too much we don't know about `ee()->TMPL->tagdata` to explain why a DOM parser might or might not be useful. It might not be useful in this case. I'm not sure what the value of `tagdata` is. – crush Sep 13 '13 at 16:00
1

@nick Regex is "dumb" (it isn't made to parse HTML), and because HTML is not a regular language, which is what regex good at parsing. With a dom parser, you are working with what you have, which is markup. Consider something like this: http://codepad.viper-7.com/5qXnjR -- oops! – Chris Baker Sep 13 '13 at 16:01
Agreed, nick, can you show us what `tagData` is, like a vardump? If there's HTML in the string, it is not "plain text" – Chris Baker Sep 13 '13 at 16:01
I'm only catching the plain text that contains an @ and a . to grab all text that may be an email address and then it converts it in to mailto link without the user having to do anything. So a DOM parser wouldn't work until after the links have been created. give me a sec I'll dump the tag and paste it here. But yes, it contains strings. – nick Sep 13 '13 at 16:05
@nick If you wish, please update your answer with a `var_dump(ee()->TMPL->tagdata);` so we may offer a more comprehensive, and better, answer! – crush Sep 13 '13 at 16:16
@nick I think you're mixing up some jargon here. Any combination of spaces, characters (with or without letters), and numbers is a string. HTML markup is a string, an email address is a string, `abc123` is a string. "Plain text" refers to string data that is usable as-is -- it should not contain code, markup, or other items that need to be parsed or interpreted. If your string has HTML in it, it is not plain text, it is HTML. Please, do post it :) – Chris Baker Sep 13 '13 at 16:22
I've posted a small section of the dump, it shows you what it's outputting. – nick Sep 13 '13 at 16:27
@nick Here's an example of how you do it with DomDocument: http://codepad.viper-7.com/EpdBKx – Chris Baker Sep 13 '13 at 17:20

Remove closing paragraph tag after email address with regex

3 Answers3