1

I am currently trying to figure out the best way to take an address line and separate it out into three fields for a file, house number, street name, and apartment number. Thankfully, the city, state, and zip are already in columns so all I have to parse out is just the three things listed above, but even that is proving difficult. My initial hope was to do this in COBOL using SQL, but I dont think I am able to use the PATINDEX example someone else had listed on a separate question thread, I kept getting -440 SQL code. My second thought was to do this in Java using the strings as arrays and checking the arrays for numbers, then letters, then a compare for "Apt" or something to that effect. I have this so far to try to test out what I'm ultimately trying to do, but I am getting out of bounds exception for the array.

class AddressTest{
    public static void main (String[] arguments){
       String adr1 = "100 village rest court";
       String adr2 = "1000 Arbor lane Apt. 21-D";
       String[] HouseNbr = new String[9];
       String[] Street = new String[20];
       String[] Apt = new String[5];

       for(int i = 0; i < adr1.length();i++){
           String[] forloop = new String[] {adr1};
           if (forloop[i].substring(0,1).matches("[0-9]")){
               if(forloop[i+1].substring(0,1).matches("[0-9]")){
                   HouseNbr[i] = forloop[i];
               }
               else if(forloop[i+1].substring(0,1).matches(" ")){
               }
               else if(forloop[i].substring(0,1).matches(" ")){
               }
               else{
                   Street[i] = forloop[i];
               }
           }
       }

       for(int j = 0; j < HouseNbr.length; j++){
               System.out.println(HouseNbr[j]);
       }
       for(int k = 0; k < Street.length; k++){
           System.out.println(Street[k]);
       }
    }   
}

Any other thoughts would be extremly helpful.

Bill Woodger
  • 12,968
  • 4
  • 38
  • 47
Aaron H
  • 29
  • 1
  • 5
  • Possible duplication of http://stackoverflow.com/questions/3481828/how-to-split-a-string-in-java – azurefrog Apr 09 '14 at 20:51
  • 1
    Some question somewhat similar to this was once answered with a very smart suggestion of crossing the candidate address with google maps API. – Leo Apr 09 '14 at 21:00
  • @Leo, that is not a bad idea at all, assuming it is quick enough. user311530 I'm sure there will be paid-for services of various types as well. Why do you need to do that anyway? How was the data-entry done? Validated, or any-old-rubbish? If you have the Zip, do you need the streetname? (I don't know, not done US addresses). Before coding it out, research some other possibilities, if you need to code, first analyse all your addresses for this data - see what sort of percentage you can deal with. – Bill Woodger Apr 09 '14 at 21:08
  • 1
    Having spent 7 years working for a company that did this commercially (in the '80s) I can assert based on intimate experience that this problem has no complete solution. There will _always_ be addresses you parse wrong. The question you have to answer is "how much accuracy are you willing to pay for?". You can get to 90% pretty cheaply but from that point the cost in development time and special-case handling goes up exponentially. If you have to cope with foreign addresses you'll be developing logic for each country and/or region separately. – Jim Garrison Apr 09 '14 at 21:44
  • Thankfully no international addresses. I think based on all the suggestions I have a pretty good idea of how I want to attack it. – Aaron H Apr 10 '14 at 12:38
  • you could upvote for the help?... – Pat B Apr 10 '14 at 13:55
  • Says I have to have 15 reputation points first :-\. I will after I get that. As I'm trying to import the "import org.apache.commons.lang3.*;", its telling me that "Access restriction: The method isNumeric(CharSequence) from the type StringUtils is not accessible due to restriction on required library C:\Program Files\Java\jre6\lib\ext\commons-lang3-3.3.2.jar". Did I not bring in the jar correctly? – Aaron H Apr 10 '14 at 14:33

3 Answers3

1

I would consider removing the unnecessary arrays and use a StringTokenizer...

public static void main(String[] args) {

     String number;
     String address;
     String aptNumber;


    String str = "This is String , split by StringTokenizer";
    StringTokenizer st = new StringTokenizer(str);

    System.out.println("---- Split by space ------");
    while (st.hasMoreElements()) {
                String s = System.out.println(st.nextElement());

                if (StringUtils.isNumeric(s) {
                    number = s;
                    continue;  
            }   

                if(s.indexOf("Apt")) {
                   aptNumber = s.substring(s.indexOf("Apt"),s.length-1);
                   continue;
                }

    }

    System.out.println("---- Split by comma ',' ------");
    StringTokenizer st2 = new StringTokenizer(str, ",");

    while (st2.hasMoreElements()) {
        System.out.println(st2.nextElement());
    }
}
Pat B
  • 1,915
  • 23
  • 40
  • And then what? How does this help in identifying "house number, street name, and apartment number"? – Bill Woodger Apr 09 '14 at 21:02
  • The string tokenizer doesn't have any ways to verify if the nextElement() is numeric or not does it? – Aaron H Apr 09 '14 at 21:05
  • extracting addresses is not always easy... the first element should give you the door number, the rest would be to put in the street name. if (st.nextElement).indexOf("Apt") > 0) should indicate if you have an apt and extract it from the address. – Pat B Apr 09 '14 at 21:09
  • Would example 2 from this one maybe do the trick if I modified it to do the substring(0,1).matches("[0-9]") after my SQL(instead of reading a file like in the example)? http://www.mkyong.com/java/java-stringtokenizer-example/ – Aaron H Apr 09 '14 at 21:11
  • you can surely do StringUtils.isNumeric using apache.commons – Pat B Apr 09 '14 at 21:11
  • I avoid doing unnecessary substrings and matches... it complicates the code and makes less debuggable. – Pat B Apr 09 '14 at 21:13
1

If you leverage the freely available U.S. Postal Service zip code finder (https://tools.usps.com/go/ZipLookupAction!input.action), you can get back an address in standardized format. The valid options on that format are documented by the USPS and will make it easier to write a very complicated regex, or a number of simple regexes, to read the standard form.

Joe Zitzelberger
  • 4,238
  • 2
  • 28
  • 42
1

I am still working on it, but for any in the future who may need to do this:

import java.util.Arrays;
import java.util.StringTokenizer;
import org.apache.commons.lang3.*;

class AddressTest{
public static void main (String[] arguments){
   String adr1 = "100 village rest court";
   //String adr2 = "1000 Arbor lane Apt. 21-D";
   String reader = new String();
   String holder = new String();
   StringTokenizer a1 = new StringTokenizer(adr1);
   String[] HouseNbr = new String[9];
   String[] StreetName = new String[20];
   String[] Apartment = new String[5];
   int counter = 0;

   while(a1.hasMoreElements()){
       reader = a1.nextElement().toString();
       System.out.println("Reader: " + reader);
       if(StringUtils.isNumeric(reader)){
           String[] HNBR = reader.split("");
           for(int i = 1; i <= reader.length();i++){
               System.out.println("HNBR:_" + HNBR[i]);
               HouseNbr[i-1] = HNBR[i];   
           }
       }
       else if(StringUtils.startsWith(reader, "Apt.")){
           holder = a1.nextElement().toString();
           String[] ANBR = holder.split("");
           for(int j = holder.length(); j >= 0;j--){
               Apartment[j] = ANBR[j];
           }

       }
       else{
           String STR[] = reader.split("");
           for(int k = 1; k <= reader.length();k++){
               if(counter == StreetName.length){
                   break;
               }
               else{
                   StreetName[counter] = STR[k];
                   if(counter < StreetName.length){
                       counter++;
                   }
               }
           }
           if((counter < StreetName.length) && a1.hasMoreElements()){
               StreetName[counter] = " ";
               counter++;
           }
       }

   }
   System.out.println(Arrays.toString(HouseNbr) + " " + Arrays.toString(StreetName)                
       + " " + Arrays.toString(Apartment));
    }   
}
Aaron H
  • 29
  • 1
  • 5
  • I think you're going to be making a lot of work for yourself and not getting such good results. Remember, unless your addresses have been normalised, you will get Apt, Appt, Art, Apat, Apartment, Aot, Spt, and many, many more such things, including variations of case and punctuation. – Bill Woodger Apr 17 '14 at 21:25
  • I checked our database and we don't have very many apartments at all...maybe %1 if that. I will probably add another if statement with another variation, but for my purposes, won't be too hard to allow for them. – Aaron H Apr 18 '14 at 12:32
  • OK. I arranged for you to be able to vote already, anyway. Good luck. – Bill Woodger Apr 18 '14 at 14:43