0

I have written a reg expression in Java that validates a given address and then create groups that will separate out the street number & name, city, state & zip code.

My code is as follows:

String address = "1600 Pennsylvania Ave NW, Washington, DC 20500";
String regex = "(\\s*\\d*\\s*,?\\s*(\\w*\\s*)+),?\\s*(\\w*\\s*)+\\s*,?\\s*(\\w{2})?\\s*,?\\s*(\\d{5})?\\s*";

        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(address);
        if (matcher.matches()) {
            int groupCount = matcher.groupCount();
            System.out.println(groupCount);
            for (int i=0; i<=groupCount;i++) {
                String group = matcher.group(i);
                System.out.println(group);
            }   
        } else {
            System.out.println("Does not matches");
        }

The output of the code is as follows:

5
1600 Pennsylvania Ave NW, Washington, DC 20500
1600 Pennsylvania Ave NW


DC
20500

I understand that second line in the O/P is the first group that is the entire string itself as per the Javadocs. But what I am not able to understand is that why is "Washington" not getting printed. Instead there are 2 spaces that gets printed.

Can someone please explain to me what is wrong here?

Some more information: I am expecting that the user might put in the , (comma) in the address string or they might not. The user can put multiple spaces between two words. The state will always be a state code.

Thanks Raj

Raj
  • 1,119
  • 4
  • 21
  • 41
  • http://stackoverflow.com/questions/6939526/java-regex-repeating-capturing-groups and pointers therein – NPE Sep 25 '14 at 18:01

2 Answers2

1

The problem is in the regex itself. You are using nested groups so instead of max 5 groups you might get more, to solve this you can use '?:' to determine what should not be considered as a group. i.e: ([\d]+) makes a group for a matcher but (?:[\d]+) don't).

But anyway i think your regex could be a little better, check this one and see if it fits your needs:

"([\\d]+)?(?:\\s?([^,]+)\\,)?(?:\\s?([^,]+)\\,)?(?:\\s?([\\w]{2}))(?:\\s?([\\d]{5}))"

or

 "([\\d]+)?(?:\\s?([\\w\\s]+)\\,)?(?:\\s?([\\w\\s]+)\\,)?(?:\\s?([\\w]{2}))(?:\\s?([\\d]{5}))"
Bruno Queiroz
  • 367
  • 2
  • 9
1

The reason you don't see results in some capture groups is you are overwriting them.

Like (\d?)+ will likely follow this sequence, match nothing->match digit->match nothing.

You have to wrap it in a capture group like this: ( (?:\d?)+ )

So after fixing your regex, it works out to be something like this:

 # "(\\s*(\\d*)\\s*,?\\s*((?:\\w*\\s*)+)),?\\s*((?:\\w*\\s*)+)\\s*,?\\s*(\\w{2})?\\s*,?\\s*(\\d{5})?\\s*"

 (                            # (1 start), Adress
      \s* 
      ( \d* )                      # (2), Number
      \s* ,? \s* 
      (                            # (3 start), Street
           (?: \w* \s* )+
      )                            # (3 end)
 )                            # (1 end)
 ,? \s* 
 (                            # (4 start), City
      (?: \w* \s* )+
 )                            # (4 end)
 \s* ,? \s* 
 ( \w{2} )?                   # (5), State
 \s* ,? \s* 
 ( \d{5} )?                   # (6), Zip
 \s* 

Output:

 **  Grp 0 -  ( pos 0 , len 46 ) 
1600 Pennsylvania Ave NW, Washington, DC 20500  
 **  Grp 1 -  ( pos 0 , len 24 ) 
1600 Pennsylvania Ave NW  
 **  Grp 2 -  ( pos 0 , len 4 ) 
1600  
 **  Grp 3 -  ( pos 5 , len 19 ) 
Pennsylvania Ave NW  
 **  Grp 4 -  ( pos 26 , len 10 ) 
Washington  
 **  Grp 5 -  ( pos 38 , len 2 ) 
DC  
 **  Grp 6 -  ( pos 41 , len 5 ) 
20500