0

I want to match the emails in following texts,

uma@cs.stanford.edu - match
uma at cs.Stanford.edu - match
http://infolab.stanford.edu/~widom/yearoff.h
we
genale.stanford.edu
n <A href="mailto:cheriton@cs.stanford.edu - match
hola   @  kirti.edu - match

Now I want to capture 2 parts of email address only like (uma) and (cs.stanford) in the email uma@cs.stanford.edu.

My current pattern is :

(\w+)[(\s+at\s+)|(\s*@\s*)]+(\w+|\w+\.\w+).edu

But it matches the string - infolab.stanford.edu - which I don't want. Can anybody suggest any modification on this?

Surjya Narayana Padhi
  • 7,741
  • 25
  • 81
  • 130
  • What do you want matched out of the `mailto:` line? Which dialect of regex are you using — what's the host language? The answers will differ between JavaScript, Python, C++, Ruby, C, Perl, Java, various dialects of SQL and PHP, to name but a few of the many possibilities. And for C, there are multiple possible regex packages, such as PCRE, or POSIX, or HS, or ... – Jonathan Leffler Oct 25 '15 at 05:35
  • 1
    Note that the square brackets form a funny character class in your regex. You use round brackets (parentheses) to enclose alternatives, not square brackets. – Jonathan Leffler Oct 25 '15 at 05:36
  • @JonathanLeffler: i used parentheses but it captures the 'at' or @ which I don't need. Is there any way, i can group and not capturing there? – Surjya Narayana Padhi Oct 25 '15 at 05:51
  • Since you've not identified the dialect of regex you're using, I don't know. It matters; PCRE (Perl compatible regular expressions) have ways of suppressing captures, but many other regex packages don't. I'm far from convinced you need the parentheses around `(\s+at\s+)` or `(\s*@\s*)`, so that capturing should be immaterial. Note that the real regex for matching email addresses is about a mile long. See [Using a regular expression to validate an email address](https://stackoverflow.com/questions/201323)! Note the third answer. – Jonathan Leffler Oct 25 '15 at 05:53
  • 1
    Your character class is wrong, or probably shouldn't be there at all. `[(\s+at\s+)|(\s*@\s*)]` is equivalent to `[+*|()at@\s]` - ie it matches any 1 of all the characters between square brackets. – Bohemian May 02 '23 at 10:09

3 Answers3

0

As long as you understand that this regex doesn't verify the correctness of your email address, but merely acts as a quick first line of defense against malformed addresses, an easy fix to your regex is as follows:

([\w.]+)(?:\s+at\s+|\s*@\s*)(\w+|\w+\.\w+).edu

In particular your regex was missing addresses with usernames containing . (which for example my main email address uses), as well as had a messed up middle part (pretending it's a character class and something weird about letting it repeat??).

greg-449
  • 109,219
  • 232
  • 102
  • 145
Blindy
  • 65,249
  • 10
  • 91
  • 131
0

This works for your sample input:

(\w+) *(?:@|\bat\b) *(\S+)

See live demo.

Regex breakdown:

  • (\w+) one or more word chars
  • * zero or more spaces (could use \s*, but no need)
  • (?:...) non capturing group (to leave username in group 1 and domain in group 2)
  • @|\bat\b either @ or at as a solitary word (\b means word boundary), so it doesn't match the at in match
  • (\S+) one or more non-whitespace

This assumes usernames are only word chars (letters, digits and the underscore). To work more generally, and dots and dashes:

([\w.-]+) *(?:@|\bat\b) *(\S+)
Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

The main issue with your current regex is that you try to use [ ] as a group. Square brackets indicate a character class, not a group. If you swap these out for ( ) you'll notice that your regex matches the desired result.

This results in the regex:

(\w+)((\s+at\s+)|(\s*@\s*))+(\w+|\w+\.\w+).edu

Optionally you could choose to remove some unnecessary groups:

(\w+)(\s+at\s+|\s*@\s*)+(\w+|\w+\.\w+).edu
3limin4t0r
  • 19,353
  • 2
  • 31
  • 52