3

I need to parse through some legal documents to find addresses inside them. Below is an example

test = "9999 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris 123 some ave 12 st, some city, NY, 10005 nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse 124 some ave 12 st, some city, NY, 10005cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed125 some ave 12 st, some city, NY, 10005 do eiusmod tempor incididunt ut labore 126 SOMETHING SOMETHING, SOME CITY, NEW YORK et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

tmp = test.scan(/(\d{3,6})(.*?)(\d{5})/)
tmp.each do |t|
  puts t.join()
end

Normally, the addresses would start with a number and end with a zip code, but in these documents that is not always the case.

Problem is that I miss some and get some unwanted results like:

9999 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris 123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum 11111

What I would like is an array of the following 4 items:

123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK

As for the last item, I am pretty sure that all addresses formatted like this would end up with Either "New York" or "NY".

I think my target pattern is:

/(ANY DIGITS BETWEEN 3 AND 6)(AT LEAST 3 WORDS BUT NOT MORE THAN 10)((TRY FIRST ZIPCODE)|(IF NO ZIP CODE THEN TRY "NEW YORK" OR "NY"))/i

Any help would be greatly appreciated.

pcasa
  • 3,710
  • 7
  • 39
  • 67
  • 1
    See this question and the answer from Matt: http://stackoverflow.com/questions/9397485/regex-street-address-match – michaelmichael Oct 06 '13 at 03:16
  • Looked at the smarty street, awesome app, but it deducts my account by more addresses than it finds so when parsing an actual document that would should produce 3 valid addresses it consumed 7 requests. At that rate, it would cost too much. Additionally it misses addresses that look like "someText127 some street, some city, NY, 10005", note no spacing between text and where house number starts. – pcasa Oct 06 '13 at 03:32
  • Right. As the linked answer states, street addresses aren't a regular language. Reliably extracting street addresses from legal documents using just regular expressions is not feasible. It will be highly prone to errors. – michaelmichael Oct 06 '13 at 03:48

2 Answers2

1

Here's what has worked for me for parsing info from legal texts:

  1. Break the complicated task down into simpler ones. Write a regex (or function using regexes) for each variation of addresses you want to capture.

  2. Write test cases for each variation. Here are a couple tests I wrote for a number parser as an example.

    test '554' do                                                                                   
      assert_equal 554, number_parser.parse('five hundred fifty-four')                              
    end                                                                                             

    test '1301' do                                                                                  
      assert_equal 1301, number_parser.parse('thirteen hundred one')                                
    end                                                                                             
  1. Since you know what the range is of some values such as state and state abbreviations, you can incorporate that knowledge into your functions to parse for the variations.
Dogweather
  • 15,512
  • 17
  • 62
  • 81
  • Writing a function to parse information is correct, but that is what I am in the process of creating. Based on your answer, I was able to achieve an acceptable workflow to get me what I needed so I did up-tick your answer and will provide what exactly I did. – pcasa Oct 07 '13 at 12:01
0

As michaelmichael and stackoverflow.com/questions/9397485/regex-street-address-match stated, there is really no way to properly scan for address, little less when documents have tremendous amounts of typos as the original example shows.

So I broke it into 2 parts.

First, a function that scans for patterns that resembles an address.

# First scan for possible addresses
def look_for_address_patterns(txt)
  resp = []
  # this looks for a number that is between 2-6 digits long (similar to house address)
  # Second part adds an anchor to the next character following it and grabs the next 1-15 items (space or txt)
  # proceeding to either 5 digits (zip code) or ending with State Name / abbrev
  scan = txt.scan(\d{2,6})(\s*(\S+\s+){1,15})((?:\d{5})|(?:NEW YORK|NY))
  scan.each do |s|
    resp.push s.join()
  end
  # Go to step 2 for verifying address before returning anything
  verify_address(resp)
end

Now we use a service like google, mapquest or yahoo to verify the addresses

def verify_address(arry)
  verified = []
  arry.each do |addr|
    url = "http://maps.googleapis.com/maps/api/geocode/json?address=" + addr
    response = JSON.parse(open(url).read)
    # compare that we got something similar in address response, remove SW and from Lane to ln is ok, but anything else is probably a different address
    matched = addr.downcase[0..8] == response['results']['formatted_address'].downcase[0..8]
    # should be storing more info like lat / lng but that is for a later project
    verified.push(response['results']['formatted_address']) if matched
  end
  return verified
end

What I know so far. The first part works pretty good, but gives False Positives as well as False Negatives (in certain cases, it missed an address entirely.) The second part helps weed out False Positives and does give a better address format (legal addresses are not always the best).

Results are capturing @ 85% of all the addresses in the document which for my project is acceptable. I am sure with some fine tuning I can bring this up, so Regex Masters please feel free to shine in.

pcasa
  • 3,710
  • 7
  • 39
  • 67