2

I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is here and I want to extract the daily situation table in the end of the page. The class of this object is

class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"

I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."

Nad Pat
  • 3,129
  • 3
  • 10
  • 20
pRo
  • 89
  • 12

4 Answers4

3

Without getting into the business of writing web scrapers, I think this should help you out:

library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
TMo
  • 435
  • 4
  • 11
1

There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.

  1. Use the httr library to get the raw html.
  2. Use str_extract from the stringr library to extract the specific piece of data from the html.
  3. I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)

r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")

normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
Joe Erinjeri
  • 1,200
  • 1
  • 7
  • 15
  • Thank you for your answer but can you elaborate a bit on the regex. How I can identify what I need to express from this

    Normal care: 57
    Intensive care: 23

    \n

    ?

    – pRo Dec 12 '21 at 08:59
1

We can convert the text obtained from Daily situation update using vroom package

library(rvest)
library(vroom)

url = 'https://covid19.public.lu/en.html'
df = url %>%
  read_html() %>% 
  html_nodes('.cmp-gridStat__item-container') %>% 
  html_text2()

vroom(df, delim = '\\n', col_names = F)

# A tibble: 22 x 1
   X1                                     
   <chr>                                  
 1 369 People tested positive for COVID-19
 2 Per 100.000 inhabitants: 58,13         
 3 Unvaccinated: 91,20  

Edit:

html_element vs html_elemnts

The pout of html_elemnts (html_nodes) is,

[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"                                       
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"                                                                     
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"                                                                                  
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884" 

and that of html_element (html_node)` is

[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "

As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.

html_text vs html_text2

The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.

More info is in rvest documentation, https://cran.r-project.org/web/packages/rvest/rvest.pdf

Nad Pat
  • 3,129
  • 3
  • 10
  • 20
  • Thank you a lot I think I am gonna choose your solution! Could you please tell me what is the difference between the html_nodes and html_node functions and what is the difference between html_text2 and html_text ? – pRo Dec 12 '21 at 10:32
  • Check the modified answer. – Nad Pat Dec 12 '21 at 12:58
1

I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.

If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.

More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.

N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.


Rapports journaliers:

https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_ https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf

API datasets:

https://data.public.lu/api/1/datasets/#


library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)

r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier

filename <- "covd_daily.pdf"

download.file(latest_daily_covid_pdf, filename, mode = "wb")

get_latest_daily_df <- function(filename) {
  
  data <- pdf_text(filename)

  text <- data[[1]] %>% strsplit(split = "\n{2,}")

  web_data <- text[[1]][3:12]

  df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
    unlist() %>%
    matrix(nrow = 10, ncol = 5, byrow = T) %>%
    as_tibble()

  colnames(df) <- text[[1]][2] %>%
    strsplit(split = "\\s{2,}") %>%
    map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
    unlist()

  title <- text[[1]][1] %>%
    strsplit(split = "\n") %>%
    unlist() %>%
    tail(1) %>%
    gsub("\\s+", " ", .) %>%
    gsub(" TOTAL", "", .)

  colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
  colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
  colnames(df)[1] <- "Metric"

  clean_col <- function(x) {
    gsub("\\s+|,", "", x) %>% as.numeric()
  }

  clean_col2 <- function(x) {
    gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
  }

  df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
    Metric = clean_col2(Metric)
  )

  return(df)
}


View(get_latest_daily_df(filename))

Output:

enter image description here


Alternate:

If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:

library(rvest)
library(magrittr)
library(stringi)
library(xml2)

page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 @hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()

columns <- page %>% html_elements(".cmp-gridStat__item")

map(columns, ~ .x %>%
  html_elements("p") %>%
  html_text(trim = T) %>%
  gsub("\n\\s{2,}", " ", .)
  %>%
  stri_remove_empty())
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • This is a great! I would really appreciate if you have the time to elaborate a bit in the use of map function in both scripts. Thank you for your answer! – pRo Dec 12 '21 at 10:24
  • 1
    `map` simply applies the user defined function to a list. In this case the list is the 5 _"columns"_ – QHarr Dec 12 '21 at 10:44