7

I am trying to use the stringr library to extract emails from a big, messy file.

str_match doesn't allow perl=TRUE, and I can't figure out the escape characters to get it to work.

Can someone recommend a relatively robust regex that would work in the context below?

c("larry@gmail.com", "larry-sally@sally.com", "larry@sally.larry.com")->emails
"SomeRegex"->regex
str_match(emails, regex)
amon
  • 57,091
  • 2
  • 89
  • 149
toomey8
  • 333
  • 3
  • 8
  • 3
    Um, what's your best guess for SomeRegex? Also, I think your example should include cases that you don't want matched. I could match all of those with `.*`, right? – Frank Oct 13 '13 at 03:42
  • If I use `grep("@", emails)`, it matches correctly. – RJ- Oct 13 '13 at 03:58
  • And also, `str_match` extracts the first matched group. Is that what you want or do you want to extract all matched groups? – RJ- Oct 13 '13 at 03:59
  • In `R`, grep usually matches a vector of multiple strings against one regexp – hwnd Oct 13 '13 at 04:05
  • @hwnd i had the impression that was what the OP wanted. – RJ- Oct 13 '13 at 04:55

3 Answers3

10
> "^[[:alnum:].-_]+@[[:alnum:].-]+$"->regex
> str_match(emails, regex)
     [,1]                   
[1,] "larry@gmail.com"      
[2,] "larry-sally@sally.com"
[3,] "larry@sally.larry.com"

The @-sign is not in need of escaping in regex. And "." and "-" are not special in character classes. If you want to add a requirement for ".com",".co", ".edu", ".org" then you should specify how complete that list needs to be.

As pointed out by M42, this is not a surefire method. In fact it is claimed that there is no sure-fire method: Using a regular expression to validate an email address

Community
  • 1
  • 1
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 1
    It will match `...@---` – Toto Oct 13 '13 at 08:52
  • Yes, it will. My understanding of the question was that the questioner needed a start that included a discussion of regex metacharacters. toomey8 did not offer a test case that had items that needed rejection. – IRTFM Oct 13 '13 at 15:55
  • this answer worked for me, but for posterity it is worth mentioning that I've moved to Python because the broader support and general libraries made lots of tasks (parsing xml, connecting to Google analytics, connecting to a google spreadsheet, getting the tld out of a URL) much easier, and with the advent of Pandas working on Python seemed more effective. – toomey8 Feb 08 '14 at 13:17
  • ... this doesn't work for lots of cases, including, e.g., things with 2 asterisks... – Sheridan Grant May 29 '20 at 05:54
  • Small improvement: use `^[[:alnum:].-_\\+]+@[[:alnum:].-]+$` to include a "+" sign in the part before the "@" (which is a valid address and can be used in gmail and G suite to create aliases). – Adi Sarid Nov 14 '22 at 11:54
4

I found this regex worked better for me:

^[[:alnum:]._-]+@[[:alnum:].-]+$

Dash does have a special meaning in a character class unless it is the last character. It is a range operator, as in "A-Z"

Ken Taylor
  • 160
  • 8
0

Actually, I'd recommend a longer regex, since the solutions above allow for an email like test@test.com. with a trailing dot.

isMail <- function(x){
   grepl("^[[:alnum:]._-]+@[[:alnum:].-]+$", x))
}
z-cool
  • 334
  • 1
  • 9