0

My regex is not returning a match even though the pattern appears to match the string:

regex = /(.+)\\\\n(\w+),\s(\w{2})\s(\d+)/
address = "6761 SW 19 St\\nPark City, PA 19020"
address =~ regex
 => nil 

I am expecting a 0 result so I can use $1,$2,$3 in order to extract the data I want.

The only thing I can imagine that is wrong here is the escape sequences. But am I right to escape like I did above?

Anthony
  • 15,435
  • 4
  • 39
  • 69
Daniel Viglione
  • 8,014
  • 9
  • 67
  • 101

2 Answers2

3

Parsing addresses is very tricky, and it's very easy to either write an overly simplistic regex that doesn't catch all the many, many special cases, or to fall down the rabbit hole of trying to catch all those special cases.

Fortunately there's already two very well developed modules for this: Geocoder and StreetAddress. I personally worked on improving StreetAddress.

StreetAddress just parses addresses as best it can.

2.3.3 :001 > address = "6761 SW 19 St\\\nPark City, PA 19020"
 => "6761 SW 19 St\\\nPark City, PA 19020" 
2.3.3 :002 > require 'street_address'
 => true 
2.3.3 :005 > StreetAddress::US.parse(address)
 => #<StreetAddress::US::Address:0x007fcc62a88ca8 @number="6761", @street="19 St\\", @street_type="Park", @unit=nil, @unit_prefix=nil, @suffix=nil, @prefix="SW", @city="City", @state="PA", @postal_code="19020", @postal_code_ext=nil> 

Note that it kept the backslash as part of the street name. A backslash in an address is quite abnormal. You can correct for this with an override to StreetAddress::US.parse which first strips trailing backslashes.

Geocoder takes a different approach to make a fuzzy match against US Census data. It's a bit more difficult to setup, but it can do a better job parsing real street addresses.

Use one of them, don't write your own. I'll go over the problems in your code only as an exercise.


There's multiple problems, and any one of them will cause the match to fail. This can't be fixed by just throwing more backslashes around until it happens to work.

First is in the address itself.

address = "6761 SW 19 St\\nPark City, PA 19020"
                        ^

\\n is a literal backslash followed by the letter n.

> address = "6761 SW 19 St\\nPark City, PA 19020"
 => "6761 SW 19 St\\nPark City, PA 19020" 
> puts address
6761 SW 19 St\nPark City, PA 19020

I expect you meant \\\n which is a literal backslash followed by the letter n.

Then your regex has multiple problems. First, again, too many backslashes.

/(.+)\\\\n(\w+),\s(\w{2})\s(\d+)/
     ^^^^^

That is two literal backslashes followed by the letter n. You need \\\n.

The next problem is trying to match "Park City," with \w.

/(.+)\\\n(\w+),\s(\w{2})\s(\d+)/
         ^^^^^^

\w is letters and numbers and underscore only, no spaces. You'd need [\w\s]+ instead.

Now that "works" for that particular address, but it's pretty brittle and will probably fail on many others.


But using address =~ regex with $1 and such is not the best way to do matches in Ruby. Instead, use regex.match(address) which returns a MatchData object. You can then use that as an array. match[0] is everything which matched. match[1] is $1 (ie. the first capture) and so on.

2.3.3 :034 > match[0]
 => "6761 SW 19 St\\\nPark City, PA 19020" 
2.3.3 :035 > match[1]
 => "6761 SW 19 St" 
2.3.3 :036 > match[2]
 => "Park City" 
2.3.3 :037 > match[3]
 => "PA" 
2.3.3 :038 > match[4]
 => "19020" 

This avoids using variables that might be blown over by other regexes and allows you to pass the MatchData object around as a single unit.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • This is the regex that ultimately worked: /(.+)\\n([\w\s]+),\s(\w{2})\s(\d+)/ – Daniel Viglione Mar 15 '17 at 23:32
  • @Donato That "worked" only because the address is incorrect. Again, `\\n` is a literal backslash followed by an n which is nonsense. It should be `\\\n` which is a literal backslash followed by a newline. You changed the regex to match that mistake in the address rather than fixing the mistake. ***Print the address*** and you'll see. – Schwern Mar 15 '17 at 23:51
  • https://stackoverflow.com/questions/648156/backslashes-in-single-quoted-strings-vs-double-quoted-strings – Daniel Viglione Nov 28 '17 at 01:09
0

Another quick alternate regex:

regex = /(.+)\\n([^,]+),\s(\w{2})\s(\d+)/

Here we use the not character class to get the suburb

grail
  • 914
  • 6
  • 14