3

I have address data that I'm trying to standardize.

This includes cleaning elements like Rd to Road and Dr to Drive.

However I am completely stumped on how to differentiate between Street and Saint. They both have the abbreviation St.

Has anybody done something like this before? Any ideas how to get around it?

My code so far (adapted from here) Watch st mary's road in the final row:

import re
import pandas as pd

# set up a df with fake addresses:
adds = pd.DataFrame({'address':['1 main st','2 garden dr.','4 foo apts','7 orchard gdns','st mary\'s road']})
print(adds)

          address
0       1 main st
1    2 garden dr.
2      4 foo apts
3  7 orchard gdns
4  st mary's road
# set up a dictionary of names to change
def suffixDict():

    return {'dr': 'drive',
            'rd': 'road',
            'st':'Street', # or 'st':'Saint' ??
            'apts':'apartments',
            'gdns':'gardens'}

# function to fix suffixes
def normalizeStreetSuffixes(inputValue):

        abbv = suffixDict() # get dict
        words = inputValue.split() # split address line
        for i,word in enumerate(words):
            w = word.lower() # lowercase
            w = re.sub(r'[^\w\'\s]*','', w) # remove some special characters
            rep = abbv[w] if w in abbv.keys() else words[i] # check dict
            words[i] = (rep[0].upper() + rep[1:]) # proper case
        return ' '.join(words) # return cleaned address line


# apply function to address data
adds.address.apply(normalizeStreetSuffixes)

0         1 Main Street
1        2 Garden Drive
2      4 Foo Apartments
3     7 Orchard Gardens
4    Street Mary's Road

You can see that Saint Mary's Road has been changed to Street Mary's Road.

SCool
  • 3,104
  • 4
  • 21
  • 49
  • Wouldn't the differentiation be whether it occurs before or after the street? – tgikal Aug 09 '19 at 15:00
  • 1
    Anecdotally, "Saint" usually appears before a noun, whereas "Street" usually appears after one. – Xophmeister Aug 09 '19 at 15:01
  • 2
    Well if it's at the end it has to be Street and if it's at the beginning (except for numbers) it has to be Saint. Analyze your own data to see if you can find exceptions. – Alex Hall Aug 09 '19 at 15:01
  • Going off Xophmeister's comment, I would suggest making a nested `if` statement to look at the first two characters and the last two characters as well, and do the replacement like that. – simplycoding Aug 09 '19 at 15:06
  • To those saying saint usually appears before a noun. We have addresses here where street can appear before a noun. `Garden Street Apartments` or `Main Street Lower` or `North Street Cottages` .... and I don't think I can tell where exactly the `St` falls in the string. Because I `.split()` the address, the words are processed separately with the dictionary, then I `.join` at the end. – SCool Aug 09 '19 at 15:11
  • Not sure if there is a way to do this, but i think that https://stackoverflow.com/questions/20290870/improving-the-extraction-of-human-names-with-nltk would be a "good" approach, if you can accept some mistakes. – Alexander Santos Aug 09 '19 at 15:15
  • But if there's an instance of `St` at the beginning, that should be representing `Saint` right? And would it be safe to assume every other instance of `St` in the address be `Street`? – simplycoding Aug 09 '19 at 16:39

0 Answers0