Strings in R - insert missing space between selected numeric characters for UK postcode

Question

Linking UK postcodes to their lat/long for a leaflet plot, but a missing space is stopping a few hundred their lat/long when left_join().

Have a df where some of the postcodes are missing the space between the first and second part of the code: should be 2 chr and 1 or 2 num in the first part (ie EH1 or FK15), and the second part 1 num and 2 chr (ie 1AD).

Want to change all the postcodes to the same then will left_join() later to get the lat/long.

Have tried the case_when(), if_else() and gsub() that was mentioned in a similar question on this site, but with no luck.

event1356 %>% 
  mutate(postcode = str_to_upper(postcode),
        postcode = case_when(
          postcode %in% "[0-9]{2}+[A-Z]{2}$" ~ " [0-9]+[A-Z]{2}$",
          TRUE ~ postcode)
           ) %>%
  filter(str_detect(postcode, pattern = "[0-9]{2}+[A-Z]{2}$"))

event1356 %>% 
  mutate(postcode = str_to_upper(postcode),
        postcode = case_when(
          postcode %in% "[0-9]{2}+[A-Z]{2}$" ~ gsub(postcode,
                                      "(^[A-Z]{2}+[0-9])([0-9]{1}+[A-Z]{2}$)", 
                                      "\\1 \\2", postcode),
          TRUE ~ postcode)
           ) %>%
  filter(str_detect(postcode, pattern = "[0-9]{2}+[A-Z]{2}$"))

Both above made no different

event1356 %>% 
  mutate(postcode = str_to_upper(postcode),
        postcode = if_else(str_detect(postcode,
        pattern = "[0-9]+[A-Z]{2}$"),
        gsub("(^[A-Z]{2}+[0-9]{1,})([0-9]{1}+[A-Z]{2}$)", "\\1 \\2"), postcode)
           ) %>% 
  filter(str_detect(postcode, pattern = "[0-9]{2}+[A-Z]{2}$"))

event1356 %>% 
  mutate(postcode = str_to_upper(postcode),
        postcode = if_else(str_detect(postcode,
        pattern = "^[A-Z]{2}+[0-9]{1,}+[0-9]+[A-Z]{2}$"),
        gsub("([0-9])([0-9]+[A-Z]{2}$)", "\\1 \\2"), postcode)
           ) %>% 
  filter(str_detect(postcode, pattern = "[0-9]{2}+[A-Z]{2}$"))

Error: Problem with mutate() column postcode. i postcode = if_else(...). x argument "x" is missing, with no default

event1356 %>% 
  mutate(postcode = str_to_upper(postcode),
        postcode = if_else(str_detect(postcode,
        pattern = "[0-9]{2}+[A-Z]{2}$"),
        gsub(postcode, "(^[A-Z]{2}+[0-9])([0-9]{1}+[A-Z]{2}$)", "\\1 \\2"), postcode)
           ) %>% 
  filter(str_detect(postcode, pattern = "[0-9]{2}+[A-Z]{2}$"))

event1356 %>% 
  mutate(postcode = str_to_upper(postcode),
        postcode = if_else(str_detect(postcode,
        pattern = "^[A-Z]{2}+[0-9]{1,}+[0-9]+[A-Z]{2}$"),
        gsub(postcode, "(^[A-Z]{2}+[0-9])([0-9]{1}+[A-Z]{2}$)", "\\1 \\2"), postcode)
           ) %>% 
  filter(str_detect(postcode, pattern = "[0-9]{2}+[A-Z]{2}$"))

event1356 %>% 
  mutate(postcode = str_to_upper(postcode),
        postcode = if_else(str_detect(postcode,
        pattern = "[0-9]+[A-Z]{2}$"),
        gsub(postcode, "(^[A-Z]{2}+[0-9]{1,})([0-9]{1}+[A-Z]{2}$)", 
             "\\1 \\2", gsub(" ", "", postcode)), postcode)
           ) %>% 
  filter(str_detect(postcode, pattern = "[0-9]{2}+[A-Z]{2}$"))

Warning: Problem with mutate() column postcode. i postcode = if_else(...). i argument 'pattern' has length > 1 and only the first element will be used

Example of the df:

postcode <- c("EH1 1AD", "EH1 1AE", "EH13 5ED", "GA3 9RD", "FK15 8ED", "Fk81tu", "FK159DY", "69005", "FK54UP", "FK10 WTF", "FK94DQ", "FK102ET", "FK159JE", "FK95HQ", "PH20BL")

I think you are overcomplicating this. You cannot have two postcodes which have identical characters but differing numbers of spaces. So just remove the spaces from the postcode in both columns and then join. — SamR, Apr 05 '22 at 18:25

score 0 · Answer 1 · answered Apr 06 '22 at 00:00

0

Wow! When I looked up the methods in which the postal code is documented! The UK government publishes the regex. (Of course, that doesn't translate directly in R.)

This works for about 99% of the cases and assumes only actual post codes are in your list.

ifelse(nchar(postcode) < 6,
       postcode,
       ifelse(str_detect(postcode, "\\s"),
              postcode,
              paste0(substr(postcode, start = 1, stop = 3),
                     " ", substr(postcode, 4,
                                 stop = nchar(postcode)))))

# [1] "EH1 1AD" "EH1 1AE"  "EH13 5ED" "GA3 9RD"  "FK15 8ED" "Fk8 1tu"  "FK1 59DY"
# [8] "69005"   "FK5 4UP"  "FK10 WTF" "FK9 4DQ"  "FK1 02ET" "FK1 59JE" "FK9 5HQ" 
#[15] "PH2 0BL"

answered Apr 06 '22 at 00:00

Kat

15,669
3
18
51

I'd read [this](https://stackoverflow.com/a/51885364/12545041) answer about that regex. I still don't understand though. The problem seems to be simply matching postcodes which are in both left and right data frames, with different numbers of spaces - so removing spaces from both columns seems the easiest solution. – SamR Apr 06 '22 at 08:14
@Kat thanks for that, made a slight change to the code as "Fk10" and "FK15" were placing the space at the wrong point. Their all postcodes just the one from France (69005) and a spelling mistake they hit w rather than 2 from what I can see. – Malcheyne Apr 06 '22 at 19:08
ifelse(nchar(postcode) < 6, - postcode, - ifelse(str_detect(postcode, "\\s"), * postcode, * paste0(substr(postcode, start = 1, stop = (nchar(postcode)-3)), " ", substr(postcode, start = (nchar(postcode)-2), stop = nchar(postcode))))) [1] "EH1 1AD" "EH1 1AE" "EH13 5ED" "GA3 9RD" "FK15 8ED" "Fk8 1tu" [7] "FK15 9DY" "69005" "FK5 4UP" "FK10 WTF" "FK9 4DQ" "FK10 2ET" [13] "FK15 9JE" "FK9 5HQ" "PH2 0BL" – Malcheyne Apr 06 '22 at 19:08
@SamR It's more to stay in the right format as I was going to use the leaflet labels and maybe a plot counting the numbers from each postcode so would have been trying to change them back later. Was thinking of mutating new columns in both data frames as you said, then join on them the fix the 189 out of 1718 for the labels. – Malcheyne Apr 06 '22 at 19:10
Just create a new column with no space and join on that, but use the original column for labels. – SamR Apr 06 '22 at 19:12
Yeah, but I'd have to hard code fix the labels as like `case_when(postcode == "FK102ET" ~ "FK10 2ET"), ...` 189 times for each original column that was not right. – Malcheyne Apr 06 '22 at 19:25

Strings in R - insert missing space between selected numeric characters for UK postcode

1 Answers1