1

I have a table that contains 2 columns.

     Column 1 | Column 2     
----------------------------
unique_number | '123 Main St. Suite 100 Chicago, IL'

I've been exploring address parsing using https://parserator.datamade.us/api-docs and ideally would like to parse the address, and put the results into new columns.

import usaddress
addr='123 Main St. Suite 100 Chicago, IL'

Two options for returning parsed results, and I plan on using whichever is easier to add to a dataframe:

  1. usaddress.parse(addr) The parse method will split your address string into components, and label each component. (returns list)
  2. usaddress.tag(addr) The tag method will try to be a little smarter it will merge consecutive components, strip commas, & return an address type (returns ordered list)

There are 26 different tags available for an address using this parser.

However, Not all addresses will contain all of these tags.

I need to grab the full address for each row, parse it, map the parsed results to each matching column in that same row.

What the tag data looks like using from_records (index isn't exactly ideal)

What the parse data looks like using from_records

I can't quite figure out how to logic of row by row calculations and mapping results.

warvolin
  • 1,030
  • 6
  • 9

2 Answers2

1

First, create a column of json responses from the parsing service

df['json_response'] = df['address'].apply(usaddress.pars)

Next, combine all the jsons into a single json string

json_combined = json.dumps(list(df['json_response']))

Finally parse the combined json to a dataframe (after parsing the json string)

df_parsed = pd.io.json.json_normalize(json.loads(json_combined))

Now you should have a structured dataframe with all required columns which you can df.join with your original dataframe to produce a single unified dataset.

Just a note, depending on the structure of the returned json, you may need to pass further arguments to the `pandas.io.json.json_normalize function. The example on the linked page is a good starting point.

doktakay
  • 148
  • 6
  • Thank you for your response! Definitely in the right direction. I'm running into an issue on the last line `list object jas no attribute values` [view of the data](https://www.dropbox.com/s/v9lkcgxqw3beda3/Screenshot%202017-12-05%2012.00.46.png?dl=0) – warvolin Dec 05 '17 at 19:59
  • It's difficult to know exactly what's going on without looking at the actual outputs, but it appears that the function `usaddress.pars` isn't returning a JSON but rather a list. You'll need to convert the list pairs to a dict that has an key-value pair structure similar to a JSON. Check out this link [here](https://stackoverflow.com/questions/4576115/convert-a-list-to-a-dictionary-in-python). – doktakay Dec 05 '17 at 22:21
1

Super late in posting this solution, but wanted to in case anyone else ran into the same problem

Address csv file headings:

name, address

Imports:

import pandas as pd
import numpy as np
import json
import itertools
import usaddress

def address_func(address):
    try:
        return usaddress.tag(address)
    except:
        return [{'AddressConverstion':'Error'}]

# import file
file = pd.read_excel('addresses.xlsx')

# apply function
file['tag_response'] = file['Full Address'].apply(address_func)

# copy values to new column
file['tags'] = file.apply(lambda row: row['tag_response'][0], axis=1)

# dump json
tags_combined = json.dumps(list(file['tags']))

# create dataframe of parsed info
df_parsed = pd.io.json.json_normalize(json.loads(tags_combined))

# merge dataframes on index
merged = file.join(df_parsed)
warvolin
  • 1,030
  • 6
  • 9