How to build this regex?

Question

I want to match the emails in following texts,

uma@cs.stanford.edu - match
uma at cs.Stanford.edu - match
http://infolab.stanford.edu/~widom/yearoff.h
we
genale.stanford.edu
n <A href="mailto:cheriton@cs.stanford.edu - match
hola   @  kirti.edu - match

Now I want to capture 2 parts of email address only like (uma) and (cs.stanford) in the email uma@cs.stanford.edu.

My current pattern is :

(\w+)[(\s+at\s+)|(\s*@\s*)]+(\w+|\w+\.\w+).edu

But it matches the string - infolab.stanford.edu - which I don't want. Can anybody suggest any modification on this?

What do you want matched out of the `mailto:` line? Which dialect of regex are you using — what's the host language? The answers will differ between JavaScript, Python, C++, Ruby, C, Perl, Java, various dialects of SQL and PHP, to name but a few of the many possibilities. And for C, there are multiple possible regex packages, such as PCRE, or POSIX, or HS, or ... — Jonathan Leffler, Oct 25 '15 at 05:35
Note that the square brackets form a funny character class in your regex. You use round brackets (parentheses) to enclose alternatives, not square brackets. — Jonathan Leffler, Oct 25 '15 at 05:36
@JonathanLeffler: i used parentheses but it captures the 'at' or @ which I don't need. Is there any way, i can group and not capturing there? — Surjya Narayana Padhi, Oct 25 '15 at 05:51
Since you've not identified the dialect of regex you're using, I don't know. It matters; PCRE (Perl compatible regular expressions) have ways of suppressing captures, but many other regex packages don't. I'm far from convinced you need the parentheses around `(\s+at\s+)` or `(\s*@\s*)`, so that capturing should be immaterial. Note that the real regex for matching email addresses is about a mile long. See [Using a regular expression to validate an email address](https://stackoverflow.com/questions/201323)! Note the third answer. — Jonathan Leffler, Oct 25 '15 at 05:53
Your character class is wrong, or probably shouldn't be there at all. `[(\s+at\s+)|(\s*@\s*)]` is equivalent to `[+*|()at@\s]` - ie it matches any 1 of all the characters between square brackets. — Bohemian, May 02 '23 at 10:09

score 0 · Answer 1 · edited May 02 '23 at 10:00

0

As long as you understand that this regex doesn't verify the correctness of your email address, but merely acts as a quick first line of defense against malformed addresses, an easy fix to your regex is as follows:

([\w.]+)(?:\s+at\s+|\s*@\s*)(\w+|\w+\.\w+).edu

In particular your regex was missing addresses with usernames containing . (which for example my main email address uses), as well as had a messed up middle part (pretending it's a character class and something weird about letting it repeat??).

edited May 02 '23 at 10:00

greg-449

109,219
232
102
145

answered Oct 25 '15 at 06:14

Blindy

65,249
10
91
131

I am using regex in python. The expression provided captures the 'at' and '@' which I don't need. – Surjya Narayana Padhi Oct 25 '15 at 06:21
Updated, but keep in mind you were capturing them too. Your syntax was just so wrong it didn't even process it. – Blindy Oct 25 '15 at 19:06

Bohemian · Answer 2 · 2023-05-02T11:18:39.433

This works for your sample input:

(\w+) *(?:@|\bat\b) *(\S+)

See live demo.

Regex breakdown:

(\w+) one or more word chars
* zero or more spaces (could use \s*, but no need)
(?:...) non capturing group (to leave username in group 1 and domain in group 2)
@|\bat\b either @ or at as a solitary word (\b means word boundary), so it doesn't match the at in match
(\S+) one or more non-whitespace

This assumes usernames are only word chars (letters, digits and the underscore). To work more generally, and dots and dashes:

([\w.-]+) *(?:@|\bat\b) *(\S+)

score 0 · Answer 3 · answered May 02 '23 at 10:26

The main issue with your current regex is that you try to use [ ] as a group. Square brackets indicate a character class, not a group. If you swap these out for ( ) you'll notice that your regex matches the desired result.

This results in the regex:

(\w+)((\s+at\s+)|(\s*@\s*))+(\w+|\w+\.\w+).edu

Optionally you could choose to remove some unnecessary groups:

(\w+)(\s+at\s+|\s*@\s*)+(\w+|\w+\.\w+).edu

How to build this regex?

3 Answers3