How to scrape headers as a different column from paragraphs with rvest asuming they have different lenghts?

Question

I want to scrape the following url: "https://www.constituteproject.org/constitution/Cuba_2018D?lang=es"

This is a Constitution that has a title for each section of the document at the CSS selector ".float-left" and the content of each section in "p". I want to use each section in a column to identitfy each paragraph. However, these two parts have a different lenght.

I've tried the following so far:

pacman::p_load(tidyverse, rvest)


url <- "https://www.constituteproject.org/constitutions?lang=es"
content <-  tryCatch(
  url %>%
    as.character() %>% 
    read_html() %>% 
    html_nodes('p') %>% 
    html_text())


titulo <-  tryCatch(
  url %>%
    as.character() %>% 
    read_html() %>% 
    html_nodes('.float-left') %>% 
    html_text())


final <- bind_cols(titulo, content)

Maybe this question and answer will help https://stackoverflow.com/questions/56673908/how-do-you-scrape-items-together-so-you-dont-lose-the-index — Dave2e, Aug 28 '21 at 00:19

score 1 · Accepted Answer · answered Aug 28 '21 at 00:40

One option to achieve your desired result would be to extract the title and the content as a dataframe using e.g. map_dfr. To this end I first extract the nodes containing both the title and the content via the CSS selector section .article-list .level2. To deal with the different lengths you could put the content which may contain multiple paragraphs inside a list column which could be unnested later on. Additionally to keep only the ARTICULOS I had to add a filter to filter out the sections which are also extracted via the CSS selector.

library(rvest)
library(tidyverse)

url <- "https://www.constituteproject.org/constitution/Cuba_2018D?lang=es"

html <- read_html(url)

foo <- html %>% 
  html_nodes('section .article-list .level2') 

final <- map_dfr(foo, ~ tibble(
  titulo = html_nodes(.x, '.float-left') %>% html_text(),
  content = list(html_nodes(.x, "p") %>% html_text()))) %>% 
  filter(!grepl("^SEC", titulo)) %>% 
  unnest_longer(content)

final
#> # A tibble: 2,145 × 2
#>    titulo     content                                                           
#>    <chr>      <chr>                                                             
#>  1 ARTÍCULO 1 Cuba es un Estado socialista de derecho, democrático, independien…
#>  2 ARTÍCULO 2 El nombre del Estado cubano es República de Cuba, el idioma ofici…
#>  3 ARTÍCULO 3 La defensa de la patria socialista es el más grande honor y el de…
#>  4 ARTÍCULO 3 El socialismo y el sistema político y social revolucionario, esta…
#>  5 ARTÍCULO 3 Los ciudadanos tienen el derecho de combatir por todos los medios…
#>  6 ARTÍCULO 4 Los símbolos nacionales son la bandera de la estrella solitaria, …
#>  7 ARTÍCULO 4 La ley define los atributos que los identifican, sus característi…
#>  8 ARTÍCULO 5 El Partido Comunista de Cuba, único, martiano, fidelista y marxis…
#>  9 ARTÍCULO 6 La Unión de Jóvenes Comunistas, organización de  la juventud cuba…
#> 10 ARTÍCULO 7 La Constitución es la norma suprema del Estado. Todos están oblig…
#> # … with 2,135 more rows

How to scrape headers as a different column from paragraphs with rvest asuming they have different lenghts?

1 Answers1