3

How can I find an email address inside HTML code with Nokogiri? I supose I will need to use regex, but don't know how.

Example code

    <html>
    <title>Example</title>
    <body>
    This is an example text.
    example@example.com
    </body>
    </html>

There is an answer covering the case when there is a href to mail_to, but that is not my case. The email addresses are sometimes inside a link, but not always.

Thanks

Mark Thomas
  • 37,131
  • 11
  • 74
  • 101
Fred Guth
  • 1,537
  • 1
  • 15
  • 27
  • This is definitely *not* a Nokogiri question, it's a text parsing question in ruby. I tagged it with `Ruby` and `regex` to improve your responses. – Mark Thomas Nov 27 '12 at 23:03

2 Answers2

6

If you're just trying to parse the email address from a string that just so happens to be HTML, Nokogiri isn't needed for this.

html_string   = "Your HTML here..."
email_address = html_string.match(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i)[0]

This isn't a perfect solution though, as the RFC for what constitutes a 'valid' email address is very lenient. This means most regular expressions you come across (the above one included) do not account for edge case valid addresses. For example, according to the RFC

$A12345@example.com

is a valid email address, but will not be matched by the above regular expressions as it stands.

deefour
  • 34,974
  • 7
  • 97
  • 90
  • Why this really is not a perfect solution is that it finds only the first email address on the page – erdomester Jun 07 '15 at 18:31
  • This **is** a perfect solution *for the question asked here*. The question had nothing to do with parsing *multiple* addresses. – deefour Jun 08 '15 at 13:09
1

Just use a regex on the HTML string, no need for Nokogiri (as @deefour suggested). For the regex itself, I'd suggest the one (called AUTO_EMAIL_RE) used by the rails autolink gem:

/[\w.!#\$%+-]+@[\w-]+(?:\.[\w-]+)+/

This should catch those edge cases that stricter regex filters miss:

RE = /[\w.!#\$%+-]+@[\w-]+(?:\.[\w-]+)+/

RE.match('abc@example.com')
#=> #<MatchData "abc@example.com">

RE.match('$A12345@example.com')
#=> #<MatchData "$A12345@example.com">

Note that if you really want to match all valid email addresses, you're going to need a mighty big regex.

Community
  • 1
  • 1
Chris Salzberg
  • 27,099
  • 4
  • 75
  • 82