2

My question is similar to this, but I'm looking for something R specific. I've got a data.frame of tens of thousands of addresses and need to pull out the postcodes. Postcodes are in the UK and formatted {LETTER_LETTER_DIGIT LETTER_LETTER_DIGIT}. Similar to the following:

"8, Longbow Close,\r\nHarlescott Lane,\r\nShrewsbury,\r\nEngland,\r\nSY1 3GZ"

I've used variations of this code with stringr to no avail:

str_extract('^(\\[Gg]\\[Ii]\\[Rr] 0\\[Aa]{2})|(((\\[A-Za-z]\\[0-9]{1,2})|((\\ 
[A-Za-z]\\[A-Ha-hJ-Yj-y]\\[0-9]{1,2})|((\\[AZa-z]\\[0-9]\\[A-Za-z])|(\\[A-Za- 
z]\\[A-Ha-hJ-Yj-y]\\[0-9]?\\[A-Za-z]))))\\[0-9]\\[A-Za-z]{2})$',alfa$Address) 
elliot
  • 1,844
  • 16
  • 45
  • Why to no avail? What happened? I guess you got no matches due to `^` and `$`. Remove them or replace with `\\b`, and use `str_extract_all`. And swap the arguments, the first one is input, the second one is the regex. And do not escape `[` that is a start of a character class. – Wiktor Stribiżew Apr 25 '18 at 11:51
  • Why do you have double-backslashes? – Hugh Apr 25 '18 at 11:55
  • @WiktorStribiżew, I'm getting NA's in the first instance. After removing the `^` and `$` and using `str_extract_all` I'm getting `character(0)`. – elliot Apr 25 '18 at 11:55
  • Because all `[` are matched as literal `[`. Remove the escapes. Why did you change the regex from the post you linked to? – Wiktor Stribiżew Apr 25 '18 at 11:56

2 Answers2

3

The ^ and $ anchors require the pattern to match the whole string. You may wrap the pattern with \b(?:<pattern>)\b to match those codes as whole words (\b is a word boundary). Besides, the character classes are "ruined" since you escaped their [ starting bracket (\[ matches literal [ chars). Also, swap the arguments, the first one is input, the second one is the regex. Also, to get all matches, you need to use str_extract_all rather than str_extract.

You may fix the code like this:

library(stringr)
txt <- "8, Longbow Close,\r\nHarlescott Lane,\r\nShrewsbury,\r\nEngland,\r\nSY1 3GZ"
pattern <- "\\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))\\s?[0-9][A-Za-z]{2}))\\b"
str_extract_all(txt, pattern)
# => [[1]]
#   [1] "SY1 3GZ"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Here is a more readable way:

            if ($e{locate} =~ /\b([A-Z])([A-Z])([0-9])([A-Z]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6.$7;
                    $e{zips} = $1.$2.$3.$4.' ' .$5.$6.$7;
            } elsif ($e{locate} =~ /\b([A-Z])([0-9])([A-Z]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6;
                    $e{zips} = $1.$2.$3.' '.$4.$5.$6;
            } elsif ($e{locate} =~ /\b([A-Z])([0-9]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5;
                    $e{zips} = $1.$2.' '.$3.$4.$5;
            } elsif ($e{locate} =~ /\b([A-Z])([0-9])([0-9]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6;
                    $e{zips} = $1.$2.$3.' '.$4.$5.$6;
            } elsif ($e{locate} =~ /\b([A-Z])([A-Z])([0-9]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6;
                    $e{zips} = $1.$2.$3.' ' .$4.$5.$6;
            } elsif ($e{locate} =~ /\b([A-Z])([A-Z])([0-9])([0-9]) ([0-9])([A-Z])([A-Z])\b/) {
                    $e{zip} = $1.$2.$3.$4.$5.$6.$7;
                    $e{zips} = $1.$2.$3.$4.' '.$5.$6.$7;
            }
Ervin Ruci
  • 829
  • 6
  • 10