Regex to extract inconsistent postal codes from string

Question

Using the solution posted here, I'm looking to extract postal codes from a list of irregular data in Excel.

Below is a sample of what my data looks like:

Brampton L6P 2G9 ON Canada

M5B2R3 Toronto ON

Toronto M5J 0A6 ON Canada

M1H1T7 Canada

Toronto M4P1T8 ON Canada

MISSISUAGABRAMPTON L5M6S6 ON Canada

333 Sea Ray Inisfil l4e2y6 ON Canada

To call the function, I'm using the following formula

=RegexExtract(A1,"^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$")

However the function is not working for me. I think I need to tweak my regex expression in some way but I don't know what I'm missing.

IIR, VBA's regex flavor doesn't support negative look-ahead assertions, and you would also need to remove the anchors with the linked function. Are you trying to *validate* the postal codes, or just extract them? — Comintern, Oct 05 '18 at 02:22
I’m trying to extract them from wherever they are located in the text — Jennifer E, Oct 05 '18 at 03:08

score 4 · Answer 1 · 2018-10-05T03:18:38.380

4

google-spreadsheet

Try,

=REGEXEXTRACT(upper(A2), "[A-X]\d[A-Z] ?\d[A-Z]\d")
'alternate
=left(REGEXEXTRACT(upper(A2), "[A-X]\d[A-Z] ?\d[A-Z]\d"), 3)&" "&right(REGEXEXTRACT(upper(A2), "[A-X]\d[A-Z] ?\d[A-Z]\d"), 3)

edited Oct 05 '18 at 03:18

answered Oct 05 '18 at 03:09

This looks like an excellent solution however I don't think it will work in Microsoft Excel – Jennifer E Oct 05 '18 at 03:19
2

excel doesn't have a RegexExtract function so I thought I would include a google-spreadsheet tag. The pattern is really what you're looking for. – Oct 05 '18 at 03:22
Good answer there.+ – QHarr Oct 05 '18 at 06:35

Comintern · Answer 2 · 2018-10-05T03:08:51.427

You have 2 issues.

First, the expression - if you need to extract the postal code, you can't anchor your regex with ^ and $. The first means "match must occur at the start of the string" and the second means "match must end at the end of the string". This is only useful if you are validating a postal code, but it obviously can't be used to extract one from your examples because they all contain things other than the postal code. The other problem with the regex is the negative look-ahead assertion (?!.*[DFIOQU]), which means "no match can contain the letters D, F, I, O, Q, or U". To the best of my recollection, this isn't supported in VBScript regex. If I'm mistaken, please correct me in the comments.

That gives you the slightly more pedantic expression:

[ABCEGHJKLMNPRSTVX]\d[ABCEGHJKLMNPRSTVWXYZ][ -]?\d[ABCEGHJKLMNPRSTVWXYZ]\d

I took the liberty of optionally allowing a - between the FSA and LDU because I see that a lot, particularly from non-Canadians.

Second, the function that you're calling (copied below from the linked answer):

Function RegexExtract(ByVal text As String, _
                      ByVal extract_what As String, _
                      Optional separator As String = ", ") As String

Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String

RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)

For i = 0 To allMatches.count - 1
    For j = 0 To allMatches.Item(i).submatches.count - 1
        result = result & (separator & allMatches.Item(i).submatches.Item(j))
    Next
Next

If Len(result) <> 0 Then
    result = Right$(result, Len(result) - Len(separator))
End If

RegexExtract = result

End Function

The first problem is that it is case sensitive. It is also tailored to extracting submatches, which you don't care about - your examples are looking for a single match.

I'd go with this much simpler option that also correctly formats the output:

Public Function ExtractCanadianPostalCode(inputText As String) As String
    With CreateObject("vbscript.regexp")
        .Pattern = "[ABCEGHJKLMNPRSTVX]\d[ABCEGHJKLMNPRSTVWXYZ][ -]?\d[ABCEGHJKLMNPRSTVWXYZ]\d"
        .IgnoreCase = True
        If .Test(inputText) Then
            Dim matches As Object
            Set matches = .Execute(inputText)
            ExtractCanadianPostalCode = UCase$(Left$(matches(0), 3) & " " & Right$(matches(0), 3))
        End If
    End With
End Function

`"[A-X]\d[A-Z] ?\d[A-Z]\d"` and while I've never actually seen a hyphen in a Canadian postal code, I suppose there are enough Americans in the world to include the possibility. — , Oct 05 '18 at 03:07
@Jeeped TBH, if they didn't need to be valid codes I'd probably use "[A-Z](\d|O)[A-Z][ -]?(\d|O)[A-Z](\d|O)". Using an 'O' for a zero is another *really* common error. — Comintern, Oct 05 '18 at 03:10
Yes, I have seen OCR convert zeroes to oh's. That's a very valid point. — , Oct 05 '18 at 03:12
Thank you both for your suggestions. These solutions worked perfectly for me. — Jennifer E, Oct 05 '18 at 03:35

Regex to extract inconsistent postal codes from string

2 Answers2