I need to parse through some legal documents to find addresses inside them. Below is an example
test = "9999 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris 123 some ave 12 st, some city, NY, 10005 nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse 124 some ave 12 st, some city, NY, 10005cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed125 some ave 12 st, some city, NY, 10005 do eiusmod tempor incididunt ut labore 126 SOMETHING SOMETHING, SOME CITY, NEW YORK et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
tmp = test.scan(/(\d{3,6})(.*?)(\d{5})/)
tmp.each do |t|
puts t.join()
end
Normally, the addresses would start with a number and end with a zip code, but in these documents that is not always the case.
Problem is that I miss some and get some unwanted results like:
9999 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris 123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum 11111
What I would like is an array of the following 4 items:
123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK
As for the last item, I am pretty sure that all addresses formatted like this would end up with Either "New York" or "NY".
I think my target pattern is:
/(ANY DIGITS BETWEEN 3 AND 6)(AT LEAST 3 WORDS BUT NOT MORE THAN 10)((TRY FIRST ZIPCODE)|(IF NO ZIP CODE THEN TRY "NEW YORK" OR "NY"))/i
Any help would be greatly appreciated.