I have a DataFrame that contains information, in particular addresses like the DataFrame below:
col1 col2
0 1303 674 Yellow Gardens,Tunbridge Wells, Kent TN5 4NP
1 1205 154 Coller Crescent Runcorn,Cheshire WP6 4TY
2 1504 122 Uphill Road,Rayleigh, Essex SF6 9VT
3 1678 67 Lampoon Crescent,Billericay, Essex, CM52 0QY
4 1897 32 Dovelane,Benfleet, Essex, PT7 6WA
5 1654 46, The Clewter,Great Durham, Essex, CD7 9HE
These are all different formats, with some having commas and others not, and there are also examples of addresses from other countries. I was wondering how would I extract the addresses from here as I want to merge them with location data.
This could mean a merge on substring or just extracting the
I have tried:
df["postcodes"] = df["address"].str.extract(r'^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$')
to extract the postcodes but this doesn't seem to work coming up with the error of 9 arguments given where 1 should be taken.
I have also attempted:
rhs = (df1.address
.apply(lambda x: df2[df2.Postcode.str.find(x).ge(0)]['location'])
.bfill(axis=1)
.iloc[:, 0])
(pd.concat([df1.app_nbr, rhs], axis=1, ignore_index=True)
.rename(columns={0: 'app_nbr', 1: 'location'}))
from here: How to merge pandas on string contains? but it is taking a long time to run on my machine given there are 1.7million postcodes in the second dataframe to match on.
The expected output would be either:
col1 col2 col3
0 1303 674 Yellow Gardens,Tunbridge Wells, Kent TN5 4NP TN5 4NP
1 1205 154 Coller Crescent Runcorn,Cheshire WP6 4TY WP6 4TY
2 1504 122 Uphill Road,Rayleigh, Essex SF6 9VT SF6 9VT
3 1678 67 Lampoon Crescent,Billericay, Essex, CM52 0QY CM52 0QY
4 1897 32 Dovelane,Benfleet, Essex, PT7 6WA PT7 6WA
5 1654 46, The Clewter,Great Durham, Essex, CD7 9HE CD7 9HE
Or (matching with the second dataframe based on postcode):
col1 col2 col3 (coords)
0 1303 674 Yellow Gardens,Tunbridge Wells, Kent TN5 4NP 50.00, 1.00
1 1205 154 Coller Crescent Runcorn,Cheshire WP6 4TY 51.23, 1.05
2 1504 122 Uphill Road,Rayleigh, Essex SF6 9VT 54.65, 1.07
3 1678 67 Lampoon Crescent,Billericay, Essex, CM52 0QY 51.23, 0.95
4 1897 32 Dovelane,Benfleet, Essex, PT7 6WA 54.6, 2.23
5 1654 46, The Clewter,Great Durham, Essex, CD7 9HE 49.25, 1.23
Any help would be appreciated or to be pointed in the right direction.
Thank you
*addresses have been altered so are not real but the format is the same