I'm grabbing information from Wikipedia on Canadian Forward Sortation Areas (FSAs - those are the first 3 digits of postal codes in Canada) and what cities/areas they belong to. Example of this information is below:
library(rvest)
library(tidyverse)
URL <- paste0("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_", "K")
FSAs <- URL %>%
read_html() %>%
html_nodes(xpath = "//td") %>%
html_text()
head(FSAs)
[1] "K1AGovernment of CanadaOttawa and Gatineau offices (partly in QC)\n" "K2AOttawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
[3] "K4AOttawa(Fallingbrook)\n" "K6AHawkesbury\n"
[5] "K7ASmiths Falls\n" "K8APembrokeCentral and northern subdivisions\n"
The problem I'm facing is that I would like to have a data frame with the first 3 digits of each spring in one column, and the rest of the information in another. I've thought there would be a solution involving a stringr
function like str_split()
, but this removes the pattern of the first 3 digits, which I of course don't want. In effect, I'm looking to split the string in-between the 3rd and 4th character of each string.
I've figured out this solution, with the last bit borrowed from this answer, but it's incredibly hackish. My question is, is there a better way of doing this?
FSAs %>%
enframe(name = NULL) %>%
separate(value, c(NA, "Location"), sep = "^...", remove = FALSE) %>%
separate(value, c("FSA", NA), sep = "(?<=\\G...)")
# A tibble: 195 x 2
FSA Location
<chr> <chr>
1 K1A "Government of CanadaOttawa and Gatineau offices (partly in QC)\n"
2 K2A "Ottawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
3 K4A "Ottawa(Fallingbrook)\n"
4 K6A "Hawkesbury\n"
5 K7A "Smiths Falls\n"
6 K8A "PembrokeCentral and northern subdivisions\n"
7 K9A "Cobourg\n"
8 K1B "Ottawa(Blackburn Hamlet / Pine View / Sheffield Glen)\n"
9 K2B "Ottawa(Britannia /Whitehaven / Bayshore / Pinecrest)\n"
10 K4B "Ottawa(Navan)\n"