0

I have a pandas dataframe looking like:

Name       Address
Alan       23 Belby road, home near me 71234
Tom        PA23 6NH brickby avenue
Solty      7 solty road 7123-234
Ben        Nowhere road 713456 Belgium

I want to get the post codes with resultant dataframe looking like this:

Name       Address                               Postcode
Alan       23 Belby road, home near me 71234     71234
Tom        PA23 6NH brickby avenue               PA23 6NH
Solty      7 solty road 7123-234                 7123-234
Ben        Nowhere road 713456 Belgium           713456

I looked at the posts at Python, Regular Expression Postcode search and python - get zipcode from full address

Unclear on how to proceed.

Mikee
  • 783
  • 1
  • 6
  • 18

1 Answers1

0

Each capture group can be specified in a re expression separated by |

Extract all pattern matches into individual columns (See. Multiple Pattern using Regex in Pandas)

Then try to coerce all matches into the first column using bfill (See. How to collapse multiple columns into one in pandas)

Then merge back into the original data set.

import pandas as pd

postcode_re = r'([Gg][Ii][Rr] 0[Aa]{2})|' \
              r'((([A-Za-z][0-9]{1,2})|' \
              r'(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|' \
              r'(([A-Za-z][0-9][A-Za-z])|' \
              r'([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})|' \
              r'(\d{5}\-?\d{0,4})|' \
              r'(\d{4}\-?\d{0,3})'

df = pd.DataFrame({'Name': {0: 'Alan', 1: 'Tom',
                            2: 'Solty', 3: 'Ben',
                            4: 'Mary', 5: 'Mike'},
                   'Address': {0: 'PA23 6NH brickby avenue',
                               1: '818 mention 560100',
                               2: 'calculate AB24 EFT',
                               3: '818 where 560100',
                               4: 'Nowhere road 713456 Belgium',
                               5: '7 solty road 7123-234'}})

df = df.merge(df['Address']
              .str
              .extractall(postcode_re)
              .bfill(axis=1)[0]
              .droplevel(level=1)
              .rename('Postcode'),
              left_index=True,
              right_index=True,
              how='left')

print(df.to_string())

Output:

    Name                      Address  Postcode
0   Alan      PA23 6NH brickby avenue  PA23 6NH
1    Tom           818 mention 560100    560100
2  Solty           calculate AB24 EFT       NaN
3    Ben             818 where 560100    560100
4   Mary  Nowhere road 713456 Belgium    713456
5   Mike        7 solty road 7123-234  7123-234

If you need more post code regexes see postal-codes.

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • why are you looking at all rows and 1st column with ```.iloc[:, 0]``` ? My interest is only on each row corresponding to a particular name. – Mikee Apr 26 '21 at 07:09
  • I’m confused by your wording “my interest is only on each row corresponding to a particular name”. Your question asks “how to extract numeric and alphanumeric post codes fun addresses”. Extractall generates a column for each capture group, if there’s a match, the match goes into the respective column, if there’s not a match, there’s a null in the column. The resulting frame from extractall in this case is 3 columns wide. But your sample output only has 1 new column “Postcode”. Hence bfill and `.iloc[:, 0]` to turn 3 columns with nulls into 1 column with as few nulls as possible. – Henry Ecker Apr 26 '21 at 12:03
  • I asked because I was getting incorrect results and lots of nulls in the Postcode column when I tried it on a larger dataset so I need to get an understanding of what was happening under the hood. – Mikee Apr 26 '21 at 12:10
  • You need to add any patterns to that regex that you want to match. If there isn’t a match you’ll get a null. There’s a link at the bottom for postal code regex by country that may be helpful. Also more specific patterns need to be first since you’ll only get the first non null value going across. You could alternatively collapse the columns into a list of possible matches depending, again, on exactly what types of post codes you’re trying to match. – Henry Ecker Apr 26 '21 at 12:26
  • This is dataset is an example of getting incorrect results. ``` df = pd.DataFrame({'Name': {0: 'Alan', 1: 'Tom', 2: 'Solty', 3: 'Ben'}, 'Address': {0: 'PA23 6NH brickby avenue', 1: '818 mention 560100', 2: 'calculate AB24 EFT', 3: '818 where 560100'}})``` – Mikee Apr 26 '21 at 12:37
  • I would expect the ```NaN``` to be where the regex does not match but in the example provided, it messes up the result of the Postcode column. – Mikee Apr 26 '21 at 12:39
  • 1
    Got it. I made some false assumptions initially. I've updated the implementation to use the [regex for the whole uk](https://stackoverflow.com/a/164994/15497888) and fixed some of those false assumptions. From my short searching it appears that `AB24 EFT` is not a valid postcode. For `AB24` group post codes, the second group _must_ start with a number. Which is why you're getting a null in that column. – Henry Ecker Apr 26 '21 at 13:03