0

So I have a dataframe of 533,668 active business registries from the County Assessor's office in an excel spreadsheet. I want to get the Addresses (currently all in one column) broken up into the AddressNumber, StreetName, StreetType, UnitNumber, City, State, etc., and I have a library (pyusaddress), which can parse through the column. I used

`def clean_address(row):
    try:
        prep_address = usaddress.tag(row)
        address = prep_address[0]
    except usaddress.RepeatedLabelError as e :
        print(e.parsed_string)
        print(e.original_string)
        address = 'Duplicate Address'
    except TypeError:
        address = "Invalid Address"
    return address

address_list = active_businesses['STREET ADDRESS'].apply(clean_address)`

The problem with this is that I get a list of OrderedDicts, which I then need to parse through to get a dataframe. I tried a for loop, but it was incredibly slow, so I wanted to know if anyone had any better ideas?

kpdebree
  • 11
  • 4

1 Answers1

0

the apply method is basically a for loop under the hood. You may get a better performance throught the np.vectorize method which works pretty much the same and has given me a better performance in the past. Refer to this post Performance of Pandas apply vs np.vectorize to create new column from existing columns

Regarding OrderedDicts there is nothing much you can do than try to parse it in a efficient way, maybe this thread can help you in that sense How to create a Pandas DataFrame from a list of OrderedDicts?

guigomcha
  • 68
  • 2
  • 6