2

I want to apply the following regex to a string. It runs fine with Grant Skinners Regexr, it also runs fine on http://www.regexplanet.com/advanced/java/index.html (case-sensitive set) but Java just won't swallow it. It never hit's the while-loop. Here's my code:

public static void main(String args[]) {
   final String testString =
      "lorem upsadsad asda 12esadas test@test.com asdlawaljkads test[at]test" +
      "[dot]com test jasdsa meter";
   final Pattern ptr =
      Pattern.compile(
         "^[A-Z0-9\\._%+-]+(@|\\s*\\[\\s*at\\s*\\]\\s*)[A-Z0-9\\.-]+" +
         "(\\.|\\s*\\[\\s*dot\\s*\\]\\s*)[a-z]{2,6}$",
         Pattern.CASE_INSENSITIVE);

    try {
        final Matcher mat = ptr.matcher(testString);
        while (mat.find()) {
            final String group1 = mat.group(1);
            System.out.println(group1);
            final String group2 = mat.group(2);
            System.out.println(group2);
            final String group3 = mat.group(3);
            System.out.println(group3);
        }
    } catch (final Exception e) {
        e.printStackTrace();
    }
}
Aubin
  • 14,617
  • 9
  • 61
  • 84
user2945856
  • 77
  • 1
  • 1
  • 5
  • 1
    What are your regex/code is suppose to do? Also I just tested your regex in regexplanet and it doesn't match your string or find any substring. – Pshemo Nov 01 '13 at 17:50
  • Could you explain what is doing that regex? –  Nov 01 '13 at 17:50
  • The regex/string provided doesnt work for me when I use regexpal – user2202911 Nov 01 '13 at 17:51
  • the regex is trying to find all the email addresses in the string, including those where @ and . is written as [at] and [dot] – atomman Nov 01 '13 at 17:51
  • My (bad) regex skills are rusty. What eats the whitespace in " test jasdsa meter" after "lorem upsadsad asda 12esadas test@test.com asdlawaljkads test[at]test [dot]com" is consumed? – Lan Nov 01 '13 at 18:05
  • Though it's been fixed in a few answers, I think it bears stating: It looks like your main problem is simply the `^` and `$` at the beginning and end of the string. It looks like you are trying to match e-mails embedded in a line, rather than as an entire line, so the begining and end of line markers are out of place here. Any other cleanup and redesigning or fixing up capturing groups aside, it appears to match the intended parts of the string with those removed. – femtoRgon Nov 01 '13 at 18:18
  • first off, thanks for all the comments. I will have a look at @femtoRgon 's suggestion now. – user2945856 Nov 01 '13 at 18:26
  • @atomman you are right, of course. This is way simpler than anything else. – user2945856 Nov 01 '13 at 18:28

3 Answers3

2

There's no need for the complicated regex. As another user suggested, replace "[dot]" with "." and "[at]" with "@", ie:

myAddressLine = myAddressLine.replace("[dot]", ".").replace("[at]","@");

Now, we can simplify your regex to:

Pattern.compile(
"\\b([a-z0-9._%+-]+)@([a-z0-9.-]+)\\.([a-z]{2,6})\\b", Pattern.CASE_INSENSITIVE);

\\b is a word boundary, which is what you want here, not the "^" and "$" indicating begins with and ends with, respectively

Notice that my capturing groups are different than yours. Before, you were capturing the "@" and "[dot]" and such. Now the "username", "domain", and the "top level domain" are being captured, which is what I assume that you want.

NB: you don't need to escape special characters in character classes, ie [.] represents a period, [\\.] is unnecessary. It still works fine, as you would need \\\\to actually match a \, which is explained here.

Community
  • 1
  • 1
Steve P.
  • 14,489
  • 8
  • 42
  • 72
  • You are right. I must have been blind to see this obvious sollution. Now since this does not reflect my original question, I am not sure if I should mark your answer as correct. However, this will do just fine. – user2945856 Nov 01 '13 at 18:29
  • @user2945856 I mean, you don't have to mark it correct, but it's a good way to do it and it works, so it's acceptable to do so. Correct doesn't mean that something had to answer your question in a very specific way. Hell, it doesn't even need to be correct. If someone gives a different answer than you're expecting, but you approve, then it's perfectly okay to accept it. If you want to, you can knowingly accept a bad answer. People may get upset, but it's all up to you. – Steve P. Nov 01 '13 at 18:33
0
final Pattern ptr = Pattern.compile(
    "\\b([A-Z0-9\\._%+-]+)"+
    "(?:@|\\s*\\[\\s*at\\s*\\]\\s*)"+
    "([A-Z0-9\\.-]+)"+
    "(?:\\.|\\s*\\[\\s*dot\\s*\\]\\s*)"+
    "([a-z]{2,6})\\b", Pattern.CASE_INSENSITIVE);
pobrelkey
  • 5,853
  • 20
  • 29
  • 1
    Why have you changed the grouping from the original regex? – atomman Nov 01 '13 at 18:04
  • The Java code after the regex in the OP is looking for three groups. It's obviously trying to match e-mail addresses in the format user-domain-TLD, so I presumed the groups should correspond to those three parts of the address. – pobrelkey Nov 01 '13 at 18:07
  • Though I believe you are correct, and had the same confusion regarding the groups. I feel like these kind of assuptions should be mentioned in your answer. – atomman Nov 01 '13 at 18:09
  • So many correct answers, I dont know what to select as correct now. Thank you very much for the reply and another solid sollution. – user2945856 Nov 01 '13 at 18:30
0

To simplify your regex, I would replace the [at] and [dot] with the actual characters first. Then just use a standard email regex such as:

matches("(?i)\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b");
EdgeCase
  • 4,719
  • 16
  • 45
  • 73
  • He would still need to use `find` not `matches`. `matches` implicitly adds `^` and `$` as explained [here](http://stackoverflow.com/questions/4450045/difference-between-matches-and-find-in-java-regex) – atomman Nov 01 '13 at 18:07
  • However, the approach is correct. Simply replacing the [dot]'s and [at]'s will do just fine. Thanks for the reply – user2945856 Nov 01 '13 at 18:31