R Regex for identifying UK postcodes

Question

My question is similar to this, but I'm looking for something R specific. I've got a data.frame of tens of thousands of addresses and need to pull out the postcodes. Postcodes are in the UK and formatted {LETTER_LETTER_DIGIT LETTER_LETTER_DIGIT}. Similar to the following:

"8, Longbow Close,\r\nHarlescott Lane,\r\nShrewsbury,\r\nEngland,\r\nSY1 3GZ"

I've used variations of this code with stringr to no avail:

str_extract('^(\\[Gg]\\[Ii]\\[Rr] 0\\[Aa]{2})|(((\\[A-Za-z]\\[0-9]{1,2})|((\\ 
[A-Za-z]\\[A-Ha-hJ-Yj-y]\\[0-9]{1,2})|((\\[AZa-z]\\[0-9]\\[A-Za-z])|(\\[A-Za- 
z]\\[A-Ha-hJ-Yj-y]\\[0-9]?\\[A-Za-z]))))\\[0-9]\\[A-Za-z]{2})$',alfa$Address)

Why to no avail? What happened? I guess you got no matches due to `^` and `$`. Remove them or replace with `\\b`, and use `str_extract_all`. And swap the arguments, the first one is input, the second one is the regex. And do not escape `[` that is a start of a character class. — Wiktor Stribiżew, Apr 25 '18 at 11:51
@WiktorStribiżew, I'm getting NA's in the first instance. After removing the `^` and `$` and using `str_extract_all` I'm getting `character(0)`. — elliot, Apr 25 '18 at 11:55
Because all `[` are matched as literal `[`. Remove the escapes. Why did you change the regex from the post you linked to? — Wiktor Stribiżew, Apr 25 '18 at 11:56

score 3 · Accepted Answer · answered Apr 25 '18 at 12:00

The ^ and $ anchors require the pattern to match the whole string. You may wrap the pattern with \b(?:<pattern>)\b to match those codes as whole words (\b is a word boundary). Besides, the character classes are "ruined" since you escaped their [ starting bracket (\[ matches literal [ chars). Also, swap the arguments, the first one is input, the second one is the regex. Also, to get all matches, you need to use str_extract_all rather than str_extract.

You may fix the code like this:

library(stringr)
txt <- "8, Longbow Close,\r\nHarlescott Lane,\r\nShrewsbury,\r\nEngland,\r\nSY1 3GZ"
pattern <- "\\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))\\s?[0-9][A-Za-z]{2}))\\b"
str_extract_all(txt, pattern)
# => [[1]]
#   [1] "SY1 3GZ"

Thanks for the explanation. Your code works perfectly! – elliot Apr 25 '18 at 12:04 — elliot, Apr 25 '18 at 12:04

score 0 · Answer 2 · answered Apr 28 '18 at 01:47

Here is a more readable way:

            if ($e{locate} =~ /\b([A-Z])([A-Z])([0-9])([A-Z]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6.$7;
                    $e{zips} = $1.$2.$3.$4.' ' .$5.$6.$7;
            } elsif ($e{locate} =~ /\b([A-Z])([0-9])([A-Z]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6;
                    $e{zips} = $1.$2.$3.' '.$4.$5.$6;
            } elsif ($e{locate} =~ /\b([A-Z])([0-9]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5;
                    $e{zips} = $1.$2.' '.$3.$4.$5;
            } elsif ($e{locate} =~ /\b([A-Z])([0-9])([0-9]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6;
                    $e{zips} = $1.$2.$3.' '.$4.$5.$6;
            } elsif ($e{locate} =~ /\b([A-Z])([A-Z])([0-9]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6;
                    $e{zips} = $1.$2.$3.' ' .$4.$5.$6;
            } elsif ($e{locate} =~ /\b([A-Z])([A-Z])([0-9])([0-9]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6.$7;
                    $e{zips} = $1.$2.$3.$4.' '.$5.$6.$7;
            }

R Regex for identifying UK postcodes

2 Answers2