How difficult this is depends on what info you want from the page. I am working to the assumption you want a dataframe/tibble that details from PointOfCare
e.g. 1. Hospitals
, through ServiceType
e.g. Publicly Funded / Free Services
, all the way down into the actual listings details of each service.
There are two immediate problems to overcome if going for all the above info:
- The DOM is pretty flat i.e. the
PointOfCare
info is at same level of DOM as ServiceType
and the start of service listings is only 1 level deeper. This means there is no nice logical way to use an HTML parser and select for parent nodes then process children, and still get the desired info mapped for the PointOfCare
and ServiceType
to each service listing.
- There are differing numbers of child nodes holding a given service's info, those with className
summaryRecordType
, within each listing (ranging between 3 and 5).
① To deal with the first problem I decide to convert the retrieved HTML to a string and split that string into chunks to process. I retrieve the PointOfCare
labels and use those to generate the initial blocks settings_blocks
:
all_text <- page %>% toString()
split_nodes <- page %>% html_nodes(xpath = '//*[@class="classyHeading"]/parent::div')
points_of_delivery <- map(split_nodes, point_of_delivery)
matches <- map(split_nodes, delivery_matches)
settings_blocks <- get_blocks(matches)
At this point I have e.g. 1. Hospitals
in the first block, 2. Inpatient services
in the second block and so on.
I then further split each of those chunks by the ServiceTypes
:
service_types <- c("Publicly Funded / Free Services", "Private Practice Professionals and\r\nCommercial Businesses")
Annoyingly, I had to hardcode as \r\n
in the latter string, rather than retrieve from the relevant node html itself, as it was not present otherwise (and therefore match was not found for split).
So, 1. Hospitals
when processed would have only a sub-chunk for Publicly Funded / Free Services
, whereas 2. Inpatient services
would end up split in two Publicly Funded / Free Services
and Private Practice Professionals and Commercial Businesses
. This all happens in a loop over settings_blocks
.
for (i in seq_along(settings_blocks)) {
r <- r + 1
point_of_care <- points_of_delivery[[i]]
splits <- split_points(settings_blocks[[i]])
nodes_html <- tryCatch(final_blocks(splits, settings_blocks[[i]]), error = function(e) print(i))
There are a couple of sections with no listings e.g. 3.3 Drop-in centres
; in those cases I generate a record as follows:
record <- list(
point_of_care, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
A full record has the following info fields:
list(point_of_care, service_type, Title, Url, BusinessName, ServiceDescription, Address, Tel, Website, AreaServed, Ages)
Once at the lowest level block, nodes_html[[j]]
, provided there are tables (each table is a listing) I retrieve the info for all the fields of interest:
records[[r]] <- map(tables, ~ record_from_table(.x, point_of_care, service_type))
② Now, we still have the issue of differing amounts of info in each listing table. However, it turns out one can map what info is present according to how many child nodes with className summaryRecordType
are present. The mapping is as follows:
| Child nodes count | BusinessName | ServiceDescription | Address | Tel | Website | AreaServed |
|-------------------|--------------|--------------------|---------|-----|---------|------------|
| 3 | | | 1 | 2 | | 3 |
| 4 | 1 | | 2 | 3 | | 4 |
| 5 | | 1 | 2 | 3 | 4 | 5 |
From column 2 onwards, the number indicates which node holds the info indicated by the column header. As I loop each bottom level chunk I have a helper function that applies this mapping when retrieving the listing info:
record_from_table <- function(table, point_of_care, service_type) {
info_lines <- table %>% html_nodes(".summaryRecordType")
Title <- table %>%
html_node("a") %>%
html_text() %>%
trimws()
Url <- table %>%
html_node("a") %>%
html_attr("href") %>%
url_absolute(link)
if (length(info_lines) == 3) {
BusinessName <- NA_character_
ServiceDescription <- NA_character_
Address <- info_lines[1] %>%
html_text() %>%
trimws() # etc.........
I pass in PointOfCare
and ServiceType
so they are mapped to the record level. By the end of
for (i in seq_along(settings_blocks)) {...}
I have a list of records/listings. I then do some tidying of the records. I return tibbles, so I can later use map_dfr
to generate my final dataframe structure:
records <- unlist(records, recursive = FALSE) %>% map(clean_record)
listings <- map_dfr(records, unlist)
With the final dataframe structure in place and populated I set about tidying up some other things I noticed:
① During my final_blocks
function the encoding of UTF-8 input strings was getting garbled.
For example, the following correctly UTF-8 encoded string (on Windows OS):
Autisme-Asperger-Québec (AAQc)
Ended up as:
Autisme-Asperger-Québec (AAQc)
A colleague pointed out that it was actually tidy_html()
at fault; and that this was particular to Windows OS - ran fine on Linux - due to the default encoding for Windows. The mangling is called Mojibake
. He pointed me to the following links for further reading:
To quote only a small part of the latter link:
The reason lies in the UTF-8 representation. Characters below or equal to 127 (0x7F) are represented with 1 byte only, and this is equivalent to the ASCII value. Characters below or equal to 2047 are written on two bytes of the form 110yyyyy 10xxxxxx where the scalar representation of the character is: 0000000000yyyyyxxxxxx
“é” is U+00E9 (LATIN SMALLER LETTER E WITH ACUTE), which in binary
representation is: 00000000 11101001. “é” is therefore between 127 and
2027 (233), so it will be coded on 2 bytes. Therefore its UTF-8
representation is 11000011 10101001.
Now let’s imagine that this “é” sits in a document that’s believed to
be latin-1, and we want to convert it to UTF-8. iso-8859-1 characters
are coded on 8 bits, so the 2-byte character “é” will become 2
1-byte-long latin-1 characters. The first character is 11000011, i.e.
C3, which, when checking the table corresponds to “Ô (U+00C3); the
second one is 10101001, i.e. A9, which corresponds to “©” (U+00A9).
The colleague pointed out I could fix this by converting it from UTF-8 to latin twice because UTF-8 characters have been encoded in UTF-8 again.
iconv(iconv(<mangled_string>, from = "UTF-8", to = "latin1"), "UTF-8", "latin1")
I had introduced tidy_html
to ensure sliced text ended up being parsable.
② I chose not to try and fix the mangled strings as per the description above. Instead, as my final dataframe provided the skeleton for where all my data resided, I simply went back to the original HTMLDocument and parsed out the info again (in UTF-8) and mapped onto my dataframe. This had the added benefit of preserving spacing between certain words and line breaks.
titles <- page %>%
html_nodes(".emhTip a:nth-of-type(1)") %>%
html_text()
descriptions <- page %>%
html_nodes(".emhTip + .summaryRecordType") %>%
html_text() %>%
trimws()
mixed_nodes <- page %>%
html_nodes(".summaryTitlePrivatePractice > div:nth-child(2)") %>%
html_text() %>%
trimws()
r <- r1 <- 0
# over-write existing values with tidier properly encoded strings
for (i in seq_along(listings$Title)) {
if (!is.na(listings$Title[i])) {
r <- r + 1
listings$Title[i] <- titles[r]
if (!is.na(listings$BusinessName[i])) {
listings$BusinessName[i] <- mixed_nodes[r]
}
}
if (!is.na(listings$ServiceDescription[i])) {
r1 <- r1 + 1
listings$ServiceDescription[i] <- descriptions[r1]
}
}
Last, but not least, I noticed that some service descriptions had a ...more
in the listing, where an additional XHR request would be required to gather the full description. I decided, in case you wanted to obtain the full descriptions, in those cases, to provide a helper function to retrieve these:
expanded_descriptions <- map2(listings$ServiceDescription, listings$Url, ~ full_description(.x, .y)) %>% unlist()
listings$ServiceDescription <- expanded_descriptions
Now, that did slow the run-time as I needed to add some delays in to ensure connections were opened and closed properly.
The full code is below, including a couple of attributions where I borrowed a few lines from other SO contributors.
R:
library(stringr)
library(rvest)
library(htmltidy)
library(tidyverse)
point_of_delivery <- function(node) {
pod <- node %>%
html_node(".classyHeading") %>%
html_text() %>%
str_split("\n") %>%
unlist() %>%
tail(1) %>%
trimws() %>%
str_replace("\xa0", " ")
return(pod)
}
delivery_matches <- function(node) {
dm <- node %>%
html_node(".classyHeading") %>%
html_text() %>%
str_split("\n") %>%
unlist() %>%
tail(1)
return(dm)
}
get_blocks <- function(a_list) {
results <- vector("list", length(a_list))
for (i in seq_along(a_list)) {
start_pos <- str_locate(all_text, gsub("\\)", "\\\\)", gsub("\\(", "\\\\(", a_list[i])))[, 1]
if (i == length(a_list)) {
block <- substring(all_text, start_pos, nchar(all_text)) %>% tidy_html()
} else {
next_start <- str_locate(all_text, gsub("\\)", "\\\\)", gsub("\\(", "\\\\(", a_list[i + 1])))[, 1]
block <- substring(all_text, start_pos, next_start) %>% tidy_html()
}
results[[i]] <- block
}
return(results)
}
split_points <- function(node) {
res <- map(service_types, ~ str_locate_all(node %>% toString(), .)) %>% unlist()
if (length(res) == 0) {
return(c(NA_integer_))
} else {
return(res[seq(1, length(res), 2)]) # https://stackoverflow.com/a/34100009/6241235 @stas g
}
}
final_blocks <- function(splits, block) {
results <- vector("list", length(splits))
if (length(splits) == 1) {
res <- ifelse(is.na(splits), splits, block %>% tidy_html())
} else {
for (i in seq_along(splits)) {
start_pos <- splits[i]
if (i == length(splits)) {
res <- substring(block, start_pos, nchar(block)) %>% tidy_html()
} else {
next_start <- splits[i + 1]
res <- substring(block, start_pos, next_start) %>% tidy_html()
}
results[i] <- res
}
return(results)
}
}
record_from_table <- function(table, point_of_care, service_type) {
info_lines <- table %>% html_nodes(".summaryRecordType")
Title <- table %>%
html_node("a") %>%
html_text() %>%
trimws()
Url <- table %>%
html_node("a") %>%
html_attr("href") %>%
url_absolute(link)
if (length(info_lines) == 3) {
BusinessName <- NA_character_
ServiceDescription <- NA_character_
Address <- info_lines[1] %>%
html_text() %>%
trimws()
Tel <- info_lines[2] %>%
html_text() %>%
trimws()
Website <- NA_character_
AreaServed <- info_lines[3] %>%
html_text() %>%
trimws()
} else if (length(info_lines) == 4) {
BusinessName <- info_lines[1] %>%
html_text() %>%
trimws()
ServiceDescription <- NA_character_
Address <- info_lines[2] %>%
html_text() %>%
trimws()
Tel <- info_lines[3] %>%
html_text() %>%
trimws()
Website <- NA_character_
AreaServed <- info_lines[4] %>%
html_text() %>%
trimws()
} else {
BusinessName <- NA_character_
ServiceDescription <- info_lines[1] %>%
html_text() %>%
trimws()
Address <- info_lines[2] %>%
html_text() %>%
trimws()
Tel <- info_lines[3] %>%
html_text() %>%
trimws()
Website <- info_lines[4] %>%
html_text() %>%
trimws()
AreaServed <- info_lines[5] %>%
html_text() %>%
trimws()
}
Ages <- get_age(table)
return(list(point_of_care, service_type, Title, Url, BusinessName, ServiceDescription, Address, Tel, Website, AreaServed, Ages))
}
get_age <- function(table) {
tryCatch(table %>% html_node(".summaryTitlePrivatePractice + td") %>%
html_text() %>% str_replace("Add to Info Cart", "") %>% trimws(), error = function(e) {
return(NA_character_)
})
}
clean_record <- function(a_record) {
a_record[[7]] <- str_replace(a_record[[7]], " Map", "")
a_record[[10]] <- str_replace(a_record[[10]], "Area[s]? Served: ", "")
a_record <- set_names(a_record, c("PointOfCare", "ServiceType", "Title", "Url", "BusinessName", "ServiceDescription", "Address", "Tel", "Website", "AreaServed", "Ages"))
return(a_record %>% as_tibble())
}
full_description <- function(current_description, current_url) {
if (grepl(" \\.\\.\\.", current_description)) {
content <- read_html(current_url, encoding = "UTF-8") %>%
html_node(".recordSummary") %>%
html_text() %>%
trimws()
CatchupPause(.1)
} else {
content <- gsub("\\s+more", "", current_description) %>% trimws()
}
return(content)
}
CatchupPause <- function(Secs) { # https://stackoverflow.com/a/52758758 @nm200
Sys.sleep(Secs) # pause to let connection work
closeAllConnections()
gc()
}
link <- "https://www.ementalhealth.ca/Winnipeg-Regional-Health-Authority/Mental-Health-Facilities/index.php?m=heading&ID=229&recordType=1&sortBy=0"
page <- read_html(link, encoding = "UTF-8")
all_text <- page %>% toString()
split_nodes <- page %>% html_nodes(xpath = '//*[@class="classyHeading"]/parent::div')
points_of_delivery <- map(split_nodes, point_of_delivery)
matches <- map(split_nodes, delivery_matches)
settings_blocks <- get_blocks(matches)
service_types <- c("Publicly Funded / Free Services", "Private Practice Professionals and\r\nCommercial Businesses") # annoying have to hardcode as \r\n not present in node output
records <- vector("list", 1000) # > max expected num entries when lists unnested
r <- 0
# Generate all records for the final tibble
for (i in seq_along(settings_blocks)) {
r <- r + 1
point_of_care <- points_of_delivery[[i]]
splits <- split_points(settings_blocks[[i]])
nodes_html <- tryCatch(final_blocks(splits, settings_blocks[[i]]), error = function(e) print(i))
if (is.na(nodes_html)[1]) {
record <- list(
point_of_care, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
)
records[[r]] <- list(record)
} else {
for (j in seq_along(nodes_html)) {
service_type <- if_else(str_detect(nodes_html[[j]], service_types[1]), service_types[1], service_types[2])
tables <- nodes_html[[j]] %>%
read_html() %>%
html_nodes(".condensedViewTable")
records[[r]] <- map(tables, ~ record_from_table(.x, point_of_care, service_type))
r <- r + 1
}
}
}
records <- unlist(records, recursive = FALSE) %>% map(clean_record)
listings <- map_dfr(records, unlist)
#
## Partly due to default Windows encoding, and lack of UTF-8 support in R, causing Mojibake via earlier tidy_html(), we grab the properly encoded info
## to overwrite the mangled text |text lacking spaces
titles <- page %>%
html_nodes(".emhTip a:nth-of-type(1)") %>%
html_text()
descriptions <- page %>%
html_nodes(".emhTip + .summaryRecordType") %>%
html_text() %>%
trimws()
mixed_nodes <- page %>%
html_nodes(".summaryTitlePrivatePractice > div:nth-child(2)") %>%
html_text() %>%
trimws()
r <- r1 <- 0
# over-write existing values with tidier properly encoded strings
for (i in seq_along(listings$Title)) {
if (!is.na(listings$Title[i])) {
r <- r + 1
listings$Title[i] <- titles[r]
if (!is.na(listings$BusinessName[i])) {
listings$BusinessName[i] <- mixed_nodes[r]
}
}
if (!is.na(listings$ServiceDescription[i])) {
r1 <- r1 + 1
listings$ServiceDescription[i] <- descriptions[r1]
}
}
# descriptions_to_expand <- dplyr::filter(listings, grepl(" \\.\\.\\.", ServiceDescription))
expanded_descriptions <- map2(listings$ServiceDescription, listings$Url, ~ full_description(.x, .y)) %>% unlist()
listings$ServiceDescription <- expanded_descriptions
write.csv(listings, "~/data.csv", na = "")
Some example rows of output:

click on image to enlarge