1

I have a DataFrame that contains information, in particular addresses like the DataFrame below:

    col1   col2
0   1303   674 Yellow Gardens,Tunbridge Wells, Kent TN5 4NP
1   1205   154 Coller Crescent Runcorn,Cheshire WP6 4TY
2   1504   122 Uphill Road,Rayleigh, Essex SF6 9VT
3   1678   67 Lampoon Crescent,Billericay, Essex, CM52 0QY
4   1897   32 Dovelane,Benfleet, Essex, PT7 6WA
5   1654   46, The Clewter,Great Durham, Essex, CD7 9HE

These are all different formats, with some having commas and others not, and there are also examples of addresses from other countries. I was wondering how would I extract the addresses from here as I want to merge them with location data.

This could mean a merge on substring or just extracting the

I have tried:

df["postcodes"] = df["address"].str.extract(r'^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$')

to extract the postcodes but this doesn't seem to work coming up with the error of 9 arguments given where 1 should be taken.

I have also attempted:

rhs = (df1.address
          .apply(lambda x: df2[df2.Postcode.str.find(x).ge(0)]['location'])
          .bfill(axis=1)
          .iloc[:, 0])

(pd.concat([df1.app_nbr, rhs], axis=1, ignore_index=True)
 .rename(columns={0: 'app_nbr', 1: 'location'}))

from here: How to merge pandas on string contains? but it is taking a long time to run on my machine given there are 1.7million postcodes in the second dataframe to match on.

The expected output would be either:

    col1   col2                                                  col3
0   1303   674 Yellow Gardens,Tunbridge Wells, Kent TN5 4NP   TN5 4NP
1   1205   154 Coller Crescent Runcorn,Cheshire WP6 4TY       WP6 4TY
2   1504   122 Uphill Road,Rayleigh, Essex SF6 9VT            SF6 9VT
3   1678   67 Lampoon Crescent,Billericay, Essex, CM52 0QY   CM52 0QY
4   1897   32 Dovelane,Benfleet, Essex, PT7 6WA               PT7 6WA
5   1654   46, The Clewter,Great Durham, Essex, CD7 9HE       CD7 9HE

Or (matching with the second dataframe based on postcode):

    col1   col2                                              col3 (coords)
0   1303   674 Yellow Gardens,Tunbridge Wells, Kent TN5 4NP   50.00, 1.00
1   1205   154 Coller Crescent Runcorn,Cheshire WP6 4TY       51.23, 1.05
2   1504   122 Uphill Road,Rayleigh, Essex SF6 9VT            54.65, 1.07
3   1678   67 Lampoon Crescent,Billericay, Essex, CM52 0QY    51.23, 0.95
4   1897   32 Dovelane,Benfleet, Essex, PT7 6WA               54.6,  2.23
5   1654   46, The Clewter,Great Durham, Essex, CD7 9HE       49.25, 1.23

Any help would be appreciated or to be pointed in the right direction.

Thank you

*addresses have been altered so are not real but the format is the same

Philip09
  • 85
  • 9

3 Answers3

0

If you always need the 2 last values , convert string to list by using split , and take the last two values in the list .

Adress="Yellow Gardens,Tunbridge Wells, Kent TN5 4NP"

Adresslist=Adress.split()

Zip = Adresslist[len(Adresslist)-1]+" "+ Adresslist[len(Adresslist)]

Nir Elbaz
  • 556
  • 4
  • 19
  • Unfortunately, they are not all in the last 2 values with many irregular ones (sorry should have specified that) – Philip09 Jun 01 '20 at 09:09
0

I don't know how irregular your data is and what your tolerance is for fiddling, but having faced pretty messy address data, sometimes you need some lateral thinking. Consider using the google maps API, throw the addresses at it, and take back the cleaned data using all the smarts of Google. For 1.7m addresses, you'll have to pay a bit, the free daily quota is pretty small.

Tim Richardson
  • 6,608
  • 6
  • 44
  • 71
  • Hi Tim, I have thought about that but I don't have the means to use the google geocoding API to go through all of the addresses due to the cost after going over the daily quota. – Philip09 Jun 01 '20 at 09:11
0

Try to use postal: https://github.com/openvenues/pypostal

It's an open-source library for parsing addresses

In [1]: from postal.parser import parse_address

In [2]: parse_address("Coller Crescent Runcorn,Cheshire WP6 4TY")
Out[2]:
[('coller crescent', 'road'),
 ('runcorn', 'city'),
 ('cheshire', 'state_district'),
 ('wp6 4ty', 'postcode')]

In [3]: parse_address("Yellow Gardens,Tunbridge Wells, Kent TN5 4NP")
Out[3]:
[('yellow gardens', 'road'),
 ('tunbridge wells', 'city'),
 ('kent', 'state_district'),
 ('tn5 4np', 'postcode')]

Also I think it will work better with real data.

  • 1
    Hi Sylwek, do you know how I would apply this to the dataframe? – Philip09 Jun 01 '20 at 09:31
  • @Sylwek_Brzeczkowski This is the start of an answer, but the question is about using `pypostal` in `pandas` so you didn't quite get there. – rjurney Oct 17 '21 at 21:28