I have address data that I'm trying to standardize.
This includes cleaning elements like Rd
to Road
and Dr
to Drive
.
However I am completely stumped on how to differentiate between Street and Saint. They both have the abbreviation St
.
Has anybody done something like this before? Any ideas how to get around it?
My code so far (adapted from here)
Watch st mary's road
in the final row:
import re
import pandas as pd
# set up a df with fake addresses:
adds = pd.DataFrame({'address':['1 main st','2 garden dr.','4 foo apts','7 orchard gdns','st mary\'s road']})
print(adds)
address
0 1 main st
1 2 garden dr.
2 4 foo apts
3 7 orchard gdns
4 st mary's road
# set up a dictionary of names to change
def suffixDict():
return {'dr': 'drive',
'rd': 'road',
'st':'Street', # or 'st':'Saint' ??
'apts':'apartments',
'gdns':'gardens'}
# function to fix suffixes
def normalizeStreetSuffixes(inputValue):
abbv = suffixDict() # get dict
words = inputValue.split() # split address line
for i,word in enumerate(words):
w = word.lower() # lowercase
w = re.sub(r'[^\w\'\s]*','', w) # remove some special characters
rep = abbv[w] if w in abbv.keys() else words[i] # check dict
words[i] = (rep[0].upper() + rep[1:]) # proper case
return ' '.join(words) # return cleaned address line
# apply function to address data
adds.address.apply(normalizeStreetSuffixes)
0 1 Main Street
1 2 Garden Drive
2 4 Foo Apartments
3 7 Orchard Gardens
4 Street Mary's Road
You can see that Saint Mary's Road has been changed to Street Mary's Road.