Parsing addresses is very tricky, and it's very easy to either write an overly simplistic regex that doesn't catch all the many, many special cases, or to fall down the rabbit hole of trying to catch all those special cases.
Fortunately there's already two very well developed modules for this: Geocoder and StreetAddress. I personally worked on improving StreetAddress.
StreetAddress just parses addresses as best it can.
2.3.3 :001 > address = "6761 SW 19 St\\\nPark City, PA 19020"
=> "6761 SW 19 St\\\nPark City, PA 19020"
2.3.3 :002 > require 'street_address'
=> true
2.3.3 :005 > StreetAddress::US.parse(address)
=> #<StreetAddress::US::Address:0x007fcc62a88ca8 @number="6761", @street="19 St\\", @street_type="Park", @unit=nil, @unit_prefix=nil, @suffix=nil, @prefix="SW", @city="City", @state="PA", @postal_code="19020", @postal_code_ext=nil>
Note that it kept the backslash as part of the street name. A backslash in an address is quite abnormal. You can correct for this with an override to StreetAddress::US.parse
which first strips trailing backslashes.
Geocoder takes a different approach to make a fuzzy match against US Census data. It's a bit more difficult to setup, but it can do a better job parsing real street addresses.
Use one of them, don't write your own. I'll go over the problems in your code only as an exercise.
There's multiple problems, and any one of them will cause the match to fail. This can't be fixed by just throwing more backslashes around until it happens to work.
First is in the address itself.
address = "6761 SW 19 St\\nPark City, PA 19020"
^
\\n
is a literal backslash followed by the letter n.
> address = "6761 SW 19 St\\nPark City, PA 19020"
=> "6761 SW 19 St\\nPark City, PA 19020"
> puts address
6761 SW 19 St\nPark City, PA 19020
I expect you meant \\\n
which is a literal backslash followed by the letter n.
Then your regex has multiple problems. First, again, too many backslashes.
/(.+)\\\\n(\w+),\s(\w{2})\s(\d+)/
^^^^^
That is two literal backslashes followed by the letter n. You need \\\n
.
The next problem is trying to match "Park City," with \w
.
/(.+)\\\n(\w+),\s(\w{2})\s(\d+)/
^^^^^^
\w
is letters and numbers and underscore only, no spaces. You'd need [\w\s]+
instead.
Now that "works" for that particular address, but it's pretty brittle and will probably fail on many others.
But using address =~ regex
with $1
and such is not the best way to do matches in Ruby. Instead, use regex.match(address)
which returns a MatchData object. You can then use that as an array. match[0]
is everything which matched. match[1]
is $1
(ie. the first capture) and so on.
2.3.3 :034 > match[0]
=> "6761 SW 19 St\\\nPark City, PA 19020"
2.3.3 :035 > match[1]
=> "6761 SW 19 St"
2.3.3 :036 > match[2]
=> "Park City"
2.3.3 :037 > match[3]
=> "PA"
2.3.3 :038 > match[4]
=> "19020"
This avoids using variables that might be blown over by other regexes and allows you to pass the MatchData object around as a single unit.