Find email addresses in large data stream

Question

STILL NOT RESOLVED :( [Feb 11th]

I have a large text file full of random data and want to pull out all the email addresses from it.

I would like to do this in Ruby, with pseudo code like this:

monster_data_string = "asfsfsdfsdfsf  sfda **joe@example.com** sdfdsf"
monster_data_string.match(EMAIL_REGEX)

Does anyone know what Ruby email regular expression I would use to accomplish this?

Please keep in mind that I'm looking for a Ruby answer to this. I have already tried numerous regex found by googling but most of them cause Ruby runtime errors stating that characters like "+" and "" are invalid/unrecognized.*

What I have already tried is:

monster_data_string.match(/^([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)

but I receive Ruby errors stating that "+" is an invalid character

Thanks in advance

please provide a constructive suggestion. If not regex, then what? — Thufir, Jan 22 '12 at 22:11
Blimey, two in 10 minutes... see http://stackoverflow.com/questions/535600 — womble, Feb 11 '09 at 06:39

score 14 · Answer 1 · answered Sep 21 '09 at 06:34

14

Watch this...

f =  File.open("content.txt")
content = f.read    
r = Regexp.new(/\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b/)     
emails = content.scan(r).uniq                                    
puts YAML.dump(emails)

answered Sep 21 '09 at 06:34

1

2015 here. Doing a check on TLD length is sooo outdated – b1nary Aug 14 '15 at 06:48

score 3 · Answer 2 · answered Feb 12 '09 at 10:31

If you're getting an error message about + or * being invalid in regexes, you're doing something very wrong. This is a valid regex in Ruby, although it's not the one you want:

/^([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})$/i

For one thing, you don't want to anchor the regex to the start and end of lines (^ and $) if you're trying to pluck the addresses from "random" text. But once you've gotten rid of the anchors, your regex will match **joe@example.com in your test string, which I presume you don't want. This regex from Regular-Expressions.info does a better job, but read that page for tips on tweaking it to meet your particular needs.

/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

Finally (and you may already know this), you won't want to use the match() method because that will only find the first match. Try scan() instead.

Noel Walters · Answer 3 · 2009-02-11T13:59:35.463

Given that it is not possible to parse every valid email address using a regexp you are left with two choices:

Make a regexp that matches as many valid email addresses as possible and live with the the fact that some valid but rarely used forms of email address might get overlooked.

or

Make a regexp that Matches anything that "might be" an email address and then live with the false positives

I use the second approach to weed out obviously wrong email addresses when validating user sign up email addresses on a web page

Gleaned from Ruby Cookbook which has a very good section on email address validation:

valid = '[^ @]+'
/^#{valid}@#{valid}\.#{valid}/

Apparently there is a 6343 character Perl regexp written by Paul Warren that does a very good job and also works in Ruby, but even that is not foolproof (I think it might also have some performance implications).

score 1 · Answer 4 · answered Feb 12 '09 at 00:10

1

What kind of runtime error messages are you gettting? Is it regarding the regexps as invalid, or is it breaking due to the target string being too large?

answered Feb 12 '09 at 00:10

Andrew Grimm

78,473
57
200
338

It's related to the regexp being invalid. Errors statting that the "+" or "*" characters are invalid/unrecognized. – Feb 12 '09 at 01:52
I've tried using the \ character to escape them but it's still not working – Feb 12 '09 at 02:52
I have tried specifically the following code string_of_data.match(/^([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})$/i) where string_of_data is the string variable read in that contains the randomly mixed data of words and email addresses – Feb 12 '09 at 02:53
You probably don't want to hear "Works for me", right? Can you try generating the simplest combination of string_of_data and regular expression that doesn't work, and the most complex combination that does work, and pasting all that on a gist or a pastie? – Andrew Grimm Feb 12 '09 at 06:16
I tried using monster_data_string = "aa **joe@example.com** sf" and regexp = /([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})/i (I removed the ^ and $) in "try ruby! (in your browser)", and that worked. – Andrew Grimm Feb 12 '09 at 08:24

score 1 · Answer 5 · answered Feb 12 '09 at 08:53

To try and help you get there (though not very elegantly, I admit):

I think the start and end anchors (^ and $) aren't helping. You may also want to filter the asterisks?:

irb(main):001:0> mds = "asfsfsdfsdfsf  sfda **joe@example.com** sdfdsf"
  => "asfsfsdfsdfsf  sfda **joe@example.com** sdfdsf"
irb(main):003:0> mds.match(/^([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
  => nil
irb(main):004:0> mds.match(/([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
  => #<MatchData "**joe@example.com" 1:"**joe" 2:"example.com">
irb(main):005:0> mds.match(/([^@\s*]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
  => #<MatchData "joe@example.com" 1:"joe" 2:"example.com">

score 0 · Answer 6 · answered Sep 25 '10 at 04:02

Even better,

require 'yaml'

content = "asfsfsdfsdfsf  sfda **joe@example.com.au** sdfdsf cool_me@example.com.fr"

r = Regexp.new(/\b([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+?)(\.[a-zA-Z.]*)\b/)     
emails = content.scan(r).uniq                                    
puts YAML.dump(emails)

will give you

    ---
    - - joe
      - example
      - .com.au
    - - cool_me
      - example
      - .com.au

Find email addresses in large data stream

6 Answers6