0

I've written a very basic regex in Ruby for scraping email-addresses off the web. It looks like the following:

/\b\w+(\.\w+)*@\w+\.\w+(\.\w+)*\b/

When I load this into irb or rubular, I create the following string:

"example@live.com"

When I run the Regexp.match(string) command in irb, I get this:

regexp.match(string) =>#<MatchData "example@live.com" 1:nil 2:nil>

So the match seems to be recorded in the MatchData object. However, when I run the String.scan(regex) command (which is what I'm primarily interested in), I get the following:

string.scan(regex) => [[nil, nil]]

Why isn't scan returning the matched email address? Is it a problem with the regular expression? Or is it a nuance of String.scan/Regexp/MatchData that somebody could make me aware of?

Richard Stokes
  • 3,532
  • 7
  • 41
  • 57

1 Answers1

3

The main issue is that your capturing groups (the stuff matched by whatever's in parentheses) aren't capturing what you want.

Let's say you want just the username and domain. You should use something along the lines of /\b(\w+(?:\.\w+)*)@(\w+(?:\.\w+)*)\.\w+\b/. As it stands, your pattern matches the input text, but the groups don't actually capture any text.

Also, why not just use /([\w\.]+)@([\w\.]+)\.\w+/? (not too familiar with ruby's regex engine, but that should be about right... you don't even need to check for word boundaries if you're using greedy quantifiers)

Steve Wang
  • 1,814
  • 1
  • 16
  • 12
  • what does ?: do? There doesn't seem to be any mention of it in the Pickaxe section on regular expressions. And I was using the very general regex you posted there to begin with, but it was returning a lot of false positives (e.g. character strings from image filenames etc). Looking for a more robust regex to eliminate the false postives. – Richard Stokes Jul 20 '11 at 19:13
  • ?: just creates a noncapturing group; I've used it here to keep the capturing groups somewhat sane, since the behavior of nested capturing groups is somewhat tricky (what if it doesn't exist, for instance?). – Steve Wang Jul 20 '11 at 22:57
  • Also, you have image filenames containing `@`? I suppose for a more robust regex, you'd check for all possible top-level domains (although that might be useless soon) via `(?:com|net|org|...)`. – Steve Wang Jul 20 '11 at 22:58
  • I think I understand capturing/non-capturing groups now. I don't want to capture ANY specific parts of the text, just match any email addresses. I removed the capturing groups from the first regex in your response above and made it more specific/robust than just \w characters and it seems to be working a bit better now, thanks – Richard Stokes Aug 10 '11 at 12:44