1

I am trying to format UK postcodes that come in as a vector of different input in R.

For example, I have the following postcodes:

postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4    9RW","G32-7EJ")

How do I write a generic code that would convert entries of the above vector into:

c("IV41 8PW","IV40 8BU","KY11 4HJ","KY1 1UU","KY4 9RW","G32 7EJ")

That is the first part of the postcode is separated from the second part of the postcode by one space and all letters are capitals.

EDIT: the second part of the postcode is always the 3 last characters (combination of a number followed by letters)

Cinnamon
  • 43
  • 4
  • In general it seems to be non-trivial, but [this answer](https://stackoverflow.com/questions/164979/regex-for-matching-uk-postcodes/51885364#51885364) may be useful – Miff Oct 06 '21 at 10:32
  • @RonakShah - the second part is always 3 characters. So one can assume that 3 last numbers and letters form the second part. – Cinnamon Oct 06 '21 at 10:32

4 Answers4

3

I couldn't come up with a smart regex solution so here is a split-apply-combine approach.

sapply(strsplit(sub('^(.*?)(...)$', '\\1:\\2', postcodes), ':', fixed = TRUE), function(x) {
  paste0(toupper(trimws(x, whitespace = '[.\\s-]')), collapse = ' ')
})

#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU"  "KY4 9RW"  "G32 7EJ" 

The logic here is that we insert a : (or any character that is not in the data) in the string between the 1st and 2nd part. Split the string on :, remove unnecessary characters, get it in upper case and combine it in one string.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • This is perfect, thank you. Could I also ask you - how do I adjust the code to include formatting of postcodes that have a space before the first element, i.e. how do I make " G48 1PG" into "G48 1PG" within your solution? – Cinnamon Oct 06 '21 at 10:48
  • 1
    `trimws(postcodes)` should remove those spaces. – Ronak Shah Oct 06 '21 at 10:49
2

One approach:

  1. Convert to uppercase

  2. extract the alphanumeric characters

  3. Paste back together with a space before the last three characters

The code would then be:

library(stringr)

postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4    9RW","G32-7EJ")

postcodes <- str_to_upper(postcodes)
sapply(str_extract_all(postcodes, "[:alnum:]"), function(x)paste(paste0(head(x,-3), collapse = ""), paste0(tail(x,3), collapse = "")))
# [1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU"  "KY4 9RW"  "G32 7EJ"
Miff
  • 7,486
  • 20
  • 20
  • The `head/tail` construct is really good, especially `head(x, -3)`, and the `'[:alnum:]'` handles fringe cases of leading space as asked below. – Chris Oct 06 '21 at 11:12
2

You can remove everything what is not a word caracter \\W (or [^[:alnum:]_]) and then insert a space before the last 3 characters with (.{3})$ and \\1.

sub("(.{3})$", " \\1", toupper(gsub("\\W+", "", postcodes)))
#sub("(...)$", " \\1", toupper(gsub("\\W+", "", postcodes))) #Alternative
#sub("(?=.{3}$)", " ", toupper(gsub("\\W+", "", postcodes)), perl=TRUE) #Alternative
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU"  "KY4 9RW"  "G32 7EJ" 
GKi
  • 37,245
  • 2
  • 26
  • 48
1
# Option 1 using regex: 
res1 <- gsub(
  "(\\w+)(\\d[[:upper:]]\\w+$)", 
  "\\1 \\2",
  gsub(
    "\\W+",
    " ",
    postcodes
  )
)

# Option 2 using substrings:
res2 <- vapply(
  trimws(
    gsub(
      "\\W+",
      " ",
      postcodes
    )
  ),
  function(ir){
    paste(
      trimws(
        substr(
          ir, 
          1,
          nchar(ir) -3
        )
      ),
      substr(
        ir, 
        nchar(ir) -2,
        nchar(ir)
      )
    )
  },
  character(1),
  USE.NAMES = FALSE
)
hello_friend
  • 5,682
  • 1
  • 11
  • 15