0

I want to be able to extract an email address embedded in tags e.g. <email> test@demo.com </email> where the src is as &lt;email&gt;test@demo.com&lt;/email&gt;

My expression I use is as follows: (?<=email&gt;).*(?=&lt;)/i). This works well. However, if the email is a hyperlink i.e. &lt;email&gt;**<a href="mailto:test@demo.com" target="_blank"**>test@demo.com</a> &lt;/email&gt; then i can no longer extract the extact email address. i get the following: <a href="mailto:test@demo.com">test@demo.com</a> instead of test@demo.com. I have tried (?<=a href="mailto:).*(?="target="_blank")/i) but nothing is returned. Any ideas on how to extract the email when the hyperlink is there?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Paul
  • 29
  • 7
  • 5
    If it's inside a tag, why don't you parse it as DOM content and just get the content of that tag? Regex seems to be ill suited here. – VLAZ Nov 09 '18 at 12:07
  • Possible duplicate of [How to validate an email address in JavaScript?](https://stackoverflow.com/questions/46155/how-to-validate-an-email-address-in-javascript) – Chango Nov 09 '18 at 12:21
  • Do as @vlaz say and parse it as DOM (you can use plain javascript or jQuery, if you wish), then you can check this the answer I posted before. – Chango Nov 09 '18 at 12:23

2 Answers2

1

Web dev 101: don't parse HTML with regex, use DOM manipulations instead.

This below logs all the emails, whether they are inside plain email tags or a inside email tags or any nesting of tags.

console.log(
  Array.from(document.getElementsByTagName('email'))
  .map(elt => elt.textContent)
  .map(email => email.trim())
)
<email>john@doe.com</email>
<email><a href="mailto:john@doe.com">john@doe.com</a></email>
<email><b><a href="mailto:john@doe.com">john@doe.com</a></b></email>
<email><span><b><a href="mailto:john@doe.com">john@doe.com</a></b></span></email>
<email>"o'brian"@irish.com</email>

The .trim() is useful in case there is whitespace in the HTML which can show up around the email.

Nino Filiu
  • 16,660
  • 11
  • 54
  • 84
  • 1
    It's even simpler to do `elt.textContent` and you wouldn't have to worry about whether it's wrapped in another tag and how many there are. Because currently this will not work for, say `email@email.com` – VLAZ Nov 09 '18 at 12:58
  • Didn't know about this! You can edit my answer as such if you wish. – Nino Filiu Nov 09 '18 at 13:06
  • 1
    I have done this. I also expanded the list of example emails to better showcase it. I threw in an uncommon email to demonstrate how it very easily handles them, unlike using a regex to parse out the email (the other answer to this questions fails at this). – VLAZ Nov 09 '18 at 13:37
0

You can parse each line of Dom and match email regex with tag content, like below snippet :

<script>
function getEmailsFromText (text)
{
    return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi);
}
var items = document.getElementsByTagName("*");
    for (var i = 0; i < items.length; i++) {
        var text = items.item(i).textContent;
        var emailIds = getEmailsFromText(text);
        if(emailIds){
        console.log("Emails ID's : "+emailIds);
        }
    }
</script>

To test, open your javascript console tab and paste the above code which inside script tag and you can see all email id's of current html page.

Abhishek
  • 1,558
  • 16
  • 28
  • 1
    This regex fails at recognising legitimate emails like `my+email@gmail.com` That's just a mild example, though, it will also fail for `email@яндекс.ру` (non-Latin characters) or `email@companydomain` - emails that don't have a TLD part. People who have apostrophes in their name like `O'Brien` are also not going to be recognised. Aside from barring legitimate emails, it allows invalid ones like ones starting with a dot. [What characters are allowed in an email address](https://stackoverflow.com/questions/2049502/what-characters-are-allowed-in-an-email-address) – VLAZ Nov 09 '18 at 13:44
  • Thanks @vlaz for sharing this. I didn't thought of such valid email id's however if you can share a regex email which supports all types would be great and even I can update my answer as well. – Abhishek Nov 11 '18 at 20:55