1

I have street address strings in different formats. I tried this old post, but did not help much. My string formats are as follows,

format 1:

string_1 = ', landlord and tenant entered into a an agreement with respect to approximately 5,569 square feet of space in the building known as "the company" located at 788 e.7th street, st. louis, missouri 55605 ( capitalized terms used herein and not otherwise defined herein shall have the respective meanings given to them in the agreement); whereas, the term of the agreement expires on may 30, 2015;'

desired output:

788 e.7th street, st. louis, missouri 55605

format 2:

string_2 = 'first floor 824 6th avenue, chicago, il where the office is located'

desired output:

824 6th avenue, chicago, il

format 3:

string_3 = 'whose address is 90 south seventh street, suite 5400, dubuque, iowa, 55402.'

desired output:

90 south seventh street, suite 5400, dubuque, iowa, 55402

So far, I tried, this for string_1,

address_match_1 = re.findall(r'((\d*)\s+(\d{1,2})(th|nd|rd).*\s([a-z]))', string_1)

I get an empty list.

For the 2nd string I tried the same and getting the empty list as follows,

address_match_2 = re.findall(r'((\d*)\s+(\d{1,2})(th|nd|rd).*\s([a-z]))', string_2)

How can I try to match using re ? They are all in different formats, how can I get suite involved in string_3? Any help would be appreciated.

user9431057
  • 1,203
  • 1
  • 14
  • 28

1 Answers1

2

Solution

This regex matches all addresses in the question:

(?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b)    

You would need to add all of the states and their abbreviations, as well as a better match for the zip code, which you can find if you google it. Also, this will only work for US addresses.

Here is the output for each of the given strings:

>>> m = re.findall(r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))", string_1)
>>> print m
[('788 e.7th street, st. louis, missouri 55605', ' ', 'missouri', ' 55605')]
>>> m = re.findall(r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))", string_2)
>>> print m
[('824 6th avenue, chicago, il', ' ', 'il', '')]
>>> m = re.findall(r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))", string_3)
>>> print m
[('90 south seventh street, suite 5400, dubuque, iowa, 55402', ' ', 'iowa', ', 55402')]
>>>

The first value of each tuple has the correct address. However, this may not be exactly what you need (see Weakness below).

Detail

Assumptions:

  • Address starts with a number fallowed by a space
  • Address ends with a state, or its abbreviation, optionally followed by a 5 digit zip code
  • The rest of the address is in between the two parts above. This part doesn't contain any numbers surrounded by spaces (i.e. with no " \d+ ").

regex string:

r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))"

r"" make string a raw string to avoid escaping special characters

(?i) to make regex case insensitive

\d+ address starts with a number followed by a space

(missouri|il|iowa)(, \d{5}| \d{5}|\b)) address ends with state optionally followed by zip code. The \b is just the 'end of word', which makes the zip code optional.

((?! \d+ ).)* any group of characters except for a number surrounded by spaces. Refer to this article for an explanation on how this works.

Weakness

Regular expressions are used to match patterns, but the addresses presented don't have much of a pattern compared with the rest of the string they may be in. Here is the pattern that I identified and that I based the solution on:

  • Address starts with a number fallowed by a space
  • Address ends with a state, or its abbreviation, optionally followed by a 5 digit zip code
  • The rest of the address is in between the two parts above. This part doesn't contain any numbers surrounded by spaces (i.e. with no " \d+ ").

Any address that violates these assumptions won't be matched correctly. For example:

  • Addresses starting with a number with letters, such as: 102A or 3B.
  • Addresses with numbers in between initial number and the state, such as one containing ' 7 street' instead of ' 7th street.'

Some of these weaknesses may be fixed with simple changes to the regex, but some may be more difficult to fix.

Luis Guzman
  • 996
  • 5
  • 8
  • 1
    thank you for the effort, I am checking it now! I have more than which, where in each string. I just added only those two three words to show an example. If you see my edit and the first string, which actually how it looks like. – user9431057 Mar 02 '18 at 02:16
  • 1
    I checked and it and I think you pointed me in the right direction. Thanks! If I have more questions, I will make sure to post them.(Since I don't have 15 points I can not upvote :( ) – user9431057 Mar 03 '18 at 00:42
  • My pleasure. Don't worry; you'll get to the 15 points soon enough. :) – Luis Guzman Mar 03 '18 at 01:35