r Regular expression for extracting UK postcode from an address is not ordered

Question

I'm trying to extract UK postcodes from address strings in R, using the regular expression provided by the UK government here.

Here is my function:

address_to_postcode <- function(addresses) {

  # 1. Convert addresses to upper case
  addresses = toupper(addresses)

  # 2. Regular expression for UK postcodes:
  pcd_regex = "[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})"

  # 3. Check if a postcode is present in each address or not (return TRUE if present, else FALSE)
  present <- grepl(pcd_regex, addresses)

  # 4. Extract postcodes matching the regular expression for a valid UK postcode
  postcodes <- regmatches(addresses, regexpr(pcd_regex, addresses))

  # 5. Return NA where an address does not contain a (valid format) UK postcode
  postcodes_out <- list()
  postcodes_out[present] <- postcodes
  postcodes_out[!present] <- NA

  # 6. Return the results in a vector (should be same length as input vector)
  return(do.call(c, postcodes_out))
}

According to the guidance document, the logic this regular expression looks for is as follows:

"GIR 0AA" OR One letter followed by either one or two numbers OR One letter followed by a second letter that must be one of ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one or two numbers OR One letter followed by one number and then another letter OR A two part post code where the first part must be One letter followed by a second letter that must be one of ABCDEFGH JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that AND The second part (separated by a space from the first part) must be One number followed by two letters. A combination of upper and lower case characters is allowed. Note: the length is determined by the regular expression and is between 2 and 8 characters.

My problem is that this logic is not completely preserved when using the regular expression without the ^ and $ anchors (as I have to do in this scenario because the postcode could be anywhere within the address strings); what I'm struggling with is how to preserve the order and number of characters for each segment in a partial (as opposed to complete) string match.

Consider the following example:

> address_to_postcode("1A noplace road, random city, NR1 2PK, UK")
[1] "NR1 2PK"

According to the logic in the guideline, the second letter in the postcode cannot be 'z' (and there are some other exclusions too); however look what happens when I add a 'z':

> address_to_postcode("1A noplace road, random city, NZ1 2PK, UK")
[1] "Z1 2PK"

... whereas in this case I would expect the output to be NA.

Adding the anchors (for a different usage case) doesn't seem to help as the 'z' is still accepted even though it is in the wrong place:

> grepl("^[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$", "NZ1 2PK")
[1] TRUE

Two questions:

Have I misunderstood the logic of the regular expression and
If not, how can I correct it (i.e. why aren't the specified letter and character ranges exclusive to their position within the regular expression)?

This bit of R code has just got me out of a big hole. Thank you! I don't know the ins and outs of Regex stuff, so I don't fully understand it, but it works, so for now, that's good enough! — Alan, Feb 01 '23 at 12:57

ctwheels · Accepted Answer · 2018-10-04T14:20:39.703

Edit

Since posting this answer, I dug deeper into the UK government's regex and found even more problems. I posted another answer here that describes all the issues and provides alternatives to their poorly formatted regex.

Note

Please note that I'm posting the raw regex here. You'll need to escape certain characters (like backslashes \) when porting to r.

Issues

You have many issues here, all of which are caused by whoever created the document you're retrieving your regex from or the coder that created it.

1. The space character

My guess is that when you copied the regular expression from the link you provided it converted the space character into a newline character and you removed it (that's exactly what I did at first). You need to, instead, change it to a space character.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                                here ^

2. Boundaries

You need to remove the anchors ^ and $ as these indicate start and end of line. Instead, wrap your regex in (?:) and place a \b (word boundary) on either end as the following shows. In fact, the regex in the documentation is incorrect (see Side note for more information) as it will fail to anchor the pattern properly.

See regex in use here

\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^                                                                                                                                                                      ^^^

3. Character class oversight

There's a missing - in the character class as pointed out by @deadcrab in his answer here.

\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
                                                                                           ^

4. They made the wrong character class optional!

In the documentation it clearly states:

A two part post code where the first part must be:

One letter followed by a second letter that must be one of ABCDEFGHJKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that

They made the wrong character class optional!

\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
                                                                                                                                        ^^^^^^
                                                                                                                        it should be this one ^^^^^^^^

5. The whole thing is just awful...

There are so many things wrong with this regex that I just decided to rewrite it. It can very easily be simplified to perform a fraction of the steps it currently takes to match text.

\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? [0-9][A-Za-z]{2}|[Gg][Ii][Rr] 0[Aa]{2})\b

Answer

As mentioned in the comments below my answer, some postcodes are missing the space character. For missing spaces in the postcodes (e.g. NR12PK), simply add a ? after the spaces as shown in the regex below:

\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})\b
                                             ^^                             ^^

You may also shorten the regex above with the following and use the case-insensitive flag (ignore.case(pattern) or ignore_case = TRUE in r, depending on the method used.):

\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2}|GIR ?0A{2})\b

Note

Please note that regular expressions only validate the possible format(s) of a string and cannot actually identify whether or not a postcode legitimately exists. For this, you should use an API. There are also some edge-cases where this regex will not properly match valid postcodes. For a list of these postcodes, please see this Wikipedia article.

The regex below additionally matches the following (make it case-insensitive to match lowercase variants as well):

British Overseas Territories
British Forces Post Office
- Although they've recently changed it to align with the British postcode system to BF, followed by a number (starting with BF1), they're considered optional alternative postcodes
Special cases outlined in that article (as well as SAN TA1 - a valid postcode for Santa!)

See this regex in use here.

\b(?:(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?|ASCN|STHL|TDCU|BBND|[BFS]IQ{2}|GX11|PCRN|TKCA) ?[0-9][A-Z]{2}|GIR ?0A{2}|SAN ?TA1|AI-?[0-9]{4}|BFPO[ -]?[0-9]{2,3}|MSR[ -]?1(?:1[12]|[23][135])0|VG[ -]?11[1-6]0|[A-Z]{2} ? [0-9]{2}|KY[1-3][ -]?[0-2][0-9]{3})\b

I would also recommend anyone implementing this answer to read this StackOverflow question titled UK Postcode Regex (Comprehensive).

Side note

The documentation you linked to (Bulk Data Transfer: Additional Validation for CAS Upload - Section 3. UK Postcode Regular Expression) actually has an improperly written regular expression.

As mentioned in the Issues section, they should have:

Wrapped the entire expression in (?:) and placed the anchors around the non-capturing group. Their regular expression, as it stands, will fail in for some cases as seen here.
The regular expression is also missing - in one of the character classes
It also made the wrong character class optional.

@ctwheels in some of my addresses the space between the two parts of the postcode is missing (so removing that was deliberate) but thanks for the boundary tip - I will try that. — Amy M, Aug 13 '18 at 19:04
@AmyM place a `?` after the space character. I've edited my answer to include it. — ctwheels, Aug 13 '18 at 19:10
@AmyM I discovered another oversight in the regex. Please see my edit to ensure you update the regex in your program. — ctwheels, Aug 14 '18 at 14:30
@ctwheels thanks for catching that - I noticed there are still a couple of instances where it is incorrectly evaluating e.g. `NR1 bla 2PK` will result in `bla 2PK` being extracted as the postcode. As a side note, which character is used for the case insensitive flag (I thought it was 'i' but couldn't see that in your code above?) — Amy M, Aug 14 '18 at 15:40
@AmyM It depends how you're implementing this pattern in your code, but it'll likely be `ignore.case(pattern)` or `ignore_case = TRUE`. I'm going to review this whole regex in a moment, it seems the UK government really does not know how to develop a proper regular expression. This thing seems pretty broken. I will look at the requirements in full and test it and get back to you. — ctwheels, Aug 14 '18 at 16:20
@AmyM I've now edited my answer again. It includes a lot more information about UK postcodes than it did originally and finds yet another issue! I've rewritten the regex to work properly and added an extra bonus regex for some edge cases. — ctwheels, Aug 14 '18 at 17:56

score 1 · Answer 2 · answered Nov 17 '21 at 15:52

1

here is my regular expression

txt="0288, Bishopsgate, London Borough of Tower Hamlets, London, Greater London, England, EC2M 4QP, United Kingdom"
matches=re.findall(r'[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}', txt)

answered Nov 17 '21 at 15:52

Golden Lion

3,840
2
26
35

Could you explain a bit more about the approach you've taken here? I can see that your regex is far shorter than the one in the accepted answer, but the accepted answer has also been designed to work with a lot of complicated edge cases - does yours handle these as well? – Amy M Nov 26 '21 at 16:50
2 alpha characters 1 numeric 1 alpha or numeric one space 1 numeric 2 alpha by ranges – Golden Lion Nov 27 '21 at 05:20

r Regular expression for extracting UK postcode from an address is not ordered

2 Answers2

Edit

Note

Issues

1. The space character

2. Boundaries

3. Character class oversight

4. They made the wrong character class optional!

5. The whole thing is just awful...

Answer

Note

Side note

Linked