3

I have a list of addresses that I would like to put into a dataframe where each row is a new address and the columns are the units of the address (title, street, city).

However, the way the list is structured, some addresses are longer than others. For example:

address = ['123 Some Street, City','45 Another Place, PO Box 123, City']

I have a pandas dataframe with the following columns:

Index     Court       Address                              Zipcode   Phone                           
0         Court 1     123 Court Dr, Springfield            12345     11111
1         Court 2     45 Court Pl, PO Box 45, Pawnee       54321     11111
2         Court 3     1725 Slough Ave, Scranton            18503     11111
3         Court 4     101 Court Ter, Unit 321, Eagleton    54322     11111

I would like to split the Address column into up to three columns depending on how many comma separators there are in the address, with NaN filling in where values will be missing.

For example, I hope the data will look like this:

Index     Court       Address          Address2     City           Zip  Phone                                          
0         Court 1     123 Court Dr     NaN          Springfield    ...   ...           
1         Court 2     45 Court Pl      PO Box 45    Pawnee         ...   ...
2         Court 3     1725 Slough Ave  NaN          Scranton       ...   ...
3         Court 4     101 Court Ter    Unit 321     Eagleton       ...   ...

I have plowed through and tried a ton of different solutions on StackOverflow to no avail. The closest I have gotten is with this code:

df2 = pd.concat([df, df['Address'].str.split(', ', expand=True)], axis=1)

But that returns a dataframe that adds the following three columns to the end structured as such:

...  0              1             2
... 123 Court Dr   Springfield   None
... 45 Court Pl    PO Box 45     Pawnee

This is close, but as you can see, for the shorter entries, the city lines up with the second address line for the longer entries.

Ideally, column 2 should populate every single row with a city, and column 1 should alternate between "None" and the second address line if applicable.

I hope this makes sense -- this is a tough one to put into words. Thanks!

n0ro
  • 477
  • 4
  • 11

3 Answers3

0

Addresses, especially those produced by human input can be tricky. But, if your addresses only fit those two formats this will work:

Note: If there is an additional format you have to account for, this will print the culprit.

def split_address(df):
    for index,row in df.iterrows():
        full_address = df['address']
        if full_address.count(',') == 3:
            split = full_address.split(',')
            row['address_1'] = split[0]
            row['address_2'] = split[1]
            row['city'] = split[2]
        else if full_address.count(',') == 2:
            split = full_address.split(',')
            row['address_1'] = split[0]
            row['city'] = split[1]
        else:
            print("address does not fit known formats {0}".format(full_address))

Essentially the two things that should help you are the string.count() function which will tell you the number of commas in a string, and the string.split() which you already found that will split the input into an array. You can reference the portions of this array to allocate the pieces to the correct column.

foobarbaz
  • 508
  • 1
  • 10
  • 27
0

You could do something like this:

df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]
alexshchep
  • 268
  • 1
  • 15
0

You can look into creating a function using the package usaddress. It has been very helpful for me when I need to split address into parts:

import usaddress

df = pd.DataFrame(['123 Main St. Suite 100 Chicago, IL', '123 Main St. PO Box 100 Chicago, IL'], columns=['Address'])

Then create functions for how you want to split the data:

def Address1(x):
    try:
        data = usaddress.tag(x)
        if 'AddressNumber' in data[0].keys() and 'StreetName' in data[0].keys() and 'StreetNamePostType' in data[0].keys():
            return data[0]['AddressNumber'] + ' ' + data[0]['StreetName'] + ' ' + data[0]['StreetNamePostType']
    except:
        pass

def Address2(x):
    try:
        data = usaddress.tag(x)
        if 'OccupancyType' in data[0].keys() and 'OccupancyIdentifier' in data[0].keys():
            return data[0]['OccupancyType'] + ' ' + data[0]['OccupancyIdentifier']
        elif 'USPSBoxType' in data[0].keys() and 'USPSBoxID' in data[0].keys():
            return data[0]['USPSBoxType'] + ' ' + data[0]['USPSBoxID']
    except:
        pass

def PlaceName(x):
    try:
        data = usaddress.tag(x)
        if 'PlaceName' in data[0].keys():
            return data[0]['PlaceName']
    except:
        pass

df['Address1'] = df.apply(lambda x: Address1(x['Address']), axis=1)
df['Address2'] = df.apply(lambda x: Address2(x['Address']), axis=1)
df['City'] = df.apply(lambda x: PlaceName(x['Address']), axis=1)

out:

                               Address      Address1    Address2     City
0   123 Main St. Suite 100 Chicago, IL  123 Main St.   Suite 100  Chicago
1  123 Main St. PO Box 100 Chicago, IL  123 Main St.  PO Box 100  Chicago
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41