0

I have a ruby code that extracts email addresses from a page. my code outputs the email address, but also captures other text as well.

I would like to pull the actual email out of this string. Sometimes, the string will include a mailto, sometimes it will not. I was trying to get the single word that occurs before the @, and anything that comes after the @ by using a split, but I'm having trouble. Any ideas? Thanks!

href="mailto:someonesname@domain.rr.com"> |  Email</a></td>
Tiago
  • 2,156
  • 2
  • 18
  • 27
Brandon
  • 1,701
  • 3
  • 16
  • 26

3 Answers3

2

Use something prebuilt:

require 'uri'

addresses = URI.extract(<<EOT, :mailto)
this is some text. mailto:foo@bar.com and more text
and some more http://foo@bar.com text
href="mailto:someonesname@domain.rr.com"> |  Email</a></td>
EOT
addresses # => ["mailto:foo@bar.com", "mailto:someonesname@domain.rr.com"]

URI comes with Ruby, and the pattern used to parse out URIs is well tested. It's not bullet-proof, but it works pretty well. If you're getting false-positives, you can use a select, reject or grep block to filter out the unwanted entries returned.

If you can't count on having mailto:, the problem becomes harder, because email addresses aren't simple to parse; There's too much variation to them. The problem is akin to validating an email address using a pattern, because, again, the format for addresses varies too much. "Using a regular expression to validate an email address" and "JavaScript Email Validation when there are (soon to be) 1000's of TLD's?" are good reads for more information.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
2

This should also work nicely though won't account for invalid email formats - it will simply extract the email address based on your two use cases.

string[/[^\"\:](\w+@.*)(?=\")/]
Ricky
  • 999
  • 6
  • 12
1

This should work

inputstring[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")

Explanation:

  • Grab the href attribute and it's contents
  • Remove the href= and qoutes
  • Remove the mailto: if it's there

Example:

irb(main):021:0> test = "href=\"mailto:francesco@hawaii.rr.com\"> |  Email DuVin</a></td>"
=> "href=\"mailto:francesco@hawaii.rr.com\"> |  Email DuVin</a></td>"
irb(main):022:0> test[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
=> "francesco@hawaii.rr.com"
irb(main):023:0> test = "href=\"francesco@hawaii.rr.com\"> |  Email DuVin</a></td>"
=> "href=\"francesco@hawaii.rr.com\"> |  Email DuVin</a></td>"
irb(main):024:0> test[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
=> "francesco@hawaii.rr.com"
Community
  • 1
  • 1
ShaneQful
  • 2,140
  • 1
  • 16
  • 22
  • Thanks, will this work if there is no mailto in the string? – Brandon Sep 02 '14 at 22:45
  • Yes this will work if there is a mailto or if there isn't a mailto. See the example in the question, to see how it works. Also if you think this is the correct answer please accept it :) – ShaneQful Sep 02 '14 at 22:56