16

My intention is to get email address from a web page. I have the page source. I am reading the page source line by line. Now I want to get email address from the current line I am reading. This current line may or may not have email. I saw a lot of regexp examples. But most of them are for validating email address. I want to get the email address from a page source not validate. It should work as http://emailx.discoveryvip.com/ is working

Some examples input lines are :

1)<p>Send details to <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;%72%65%62%65%6b%61%68@%68%61%63%6b%73%75%72%66%65%72.%63%6f%6d">neeraj@yopmail.com</a></p>

2)<p>Interested should send details directly to <a href="http://www.abcdef.com/abcdef/">www.abcdef.com/abcdef/</a>. Should you have any questions, please email <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;%6a%6f%62%73@%72%65%6c%61%79.%65%64%75">neeraj@yopmail.com</a>.

3)Note :- Send your queries at  neeraj@yopmail.com  for more details call Mr. neeraj 012345678901.

I want to get neeraj@yopmail.com from examples 1,2 and 3. I am using java and I am not good in rexexp. Help me.

Neeraj
  • 1,612
  • 7
  • 29
  • 47
  • 1
    Did you check what google says about "java regex email"? – Vitaly Apr 17 '13 at 07:14
  • check the page source of the http://emailx.discoveryvip.com/. They have given the method to extract email. But i want a java version – Neeraj Apr 17 '13 at 07:15
  • 2
    What have you tried? Stack Overflow is a Q&A site, not a "do my work for me" site. Show us what you have so we can assist you with your specific problem. – Jared Ng Apr 17 '13 at 07:15
  • @Vitaly Yes. The relavant one was http://stackoverflow.com/questions/2250820/java-email-extraction-regular-expression. But it is not working. – Neeraj Apr 17 '13 at 07:18
  • @Neeraj, as you already know how to validate email, you can just do one more step further, capture the matched group, the data in group is exactly what you want. – hiway Apr 17 '13 at 07:19
  • @JaredNg I told that i want some help in regexp part. I have given the input Strings. I just want the regexp. – Neeraj Apr 17 '13 at 07:20
  • @Neeraj There are a lot of samples of email parsing. If something is not working then show us how you tried it. That's simple. Now it looks like a job description :) – Vitaly Apr 17 '13 at 07:20
  • @jamp : No. I am experimenting something... :) – Neeraj Apr 17 '13 at 07:21
  • @HiwayChe No the validation will not work for a whole String that contains email. In my regexp it will give only the input string is email or not. If you give the whole line it will always give false. – Neeraj Apr 17 '13 at 07:24
  • @Neeraj: There's a big difference between "Here are my inputs, write a regex for me" and "Here's the regex I wrote but it's not working on this particular input, can you explain why?" – Jared Ng Apr 17 '13 at 07:27

4 Answers4

18

You can validate e-mail address formats as according to RFC 2822, with this:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

and here's an explanation from regular-expressions.info:

This regex has two parts: the part before the @, and the part after the @. There are two alternatives for the part before the @: it can either consist of a series of letters, digits and certain symbols, including one or more dots. However, dots may not appear consecutively or at the start or end of the email address. The other alternative requires the part before the @ to be enclosed in double quotes, allowing any string of ASCII characters between the quotes. Whitespace characters, double quotes and backslashes must be escaped with backslashes.

And you can check this out here: Rubular example.

Community
  • 1
  • 1
14

The correct code is

Pattern p = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b",
    Pattern.CASE_INSENSITIVE);
Matcher matcher = p.matcher(input);
Set<String> emails = new HashSet<String>();
while(matcher.find()) {
  emails.add(matcher.group());
}

This will give the list of mail address in your long text / html input.

arulraj.net
  • 4,579
  • 3
  • 36
  • 37
  • 2
    This does not take into account domain names which have more than two parts, for example in UK you have addresses like something@company.co.uk. Also nowadays you have bunch of new TLDs that are longer than 4 characters. – Juha Palomäki Mar 11 '17 at 13:14
4

You need something like this regex:

".*(\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b).*"

When it matches, you can extract the first group and that will be your email.

String regex = ".*(\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b).*";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("your text here");
if (m.matches()) {
    String email = m.group(1);
    //do somethinfg with your email
}
Sahil Chhabra
  • 10,621
  • 4
  • 63
  • 62
Petar Ivanov
  • 91,536
  • 11
  • 82
  • 95
2

This is a simple way to extract all emails from input String using Patterns.EMAIL_ADDRESS:

    public static List<String> getEmails(@NonNull String input) {
        List<String> emails = new ArrayList<>();
        Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
        while (matcher.find()) {
            int matchStart = matcher.start(0);
            int matchEnd = matcher.end(0);
            emails.add(input.substring(matchStart, matchEnd));
        }
        return emails;
    }
Duy Pham
  • 1,179
  • 1
  • 14
  • 19