So I am attempting to webscrape a webpage that has irregular blocks of data that is organized in a manner easy to spot with the eye. Let's imagine we are looking at wikipedia. If I am scraping the text from articles of the following link I end up with 33 entries. If I instead grab just the headers, I end up with only 7 (see code below). This result does not surprise us as we know that some sections of articles have multiple paragraphs while others have only one or no paragraph text.
My question though is, how do I associate my headers with my texts. If there were the same number of paragraphs per header or some multiple, this would be trivial.
library(rvest)
wiki <- html("https://en.wikipedia.org/wiki/Web_scraping")
wikitext <- wiki %>%
html_nodes('p+ ul li , p') %>%
html_text(trim=TRUE)
wikiheading <- wiki %>%
html_nodes('.mw-headline') %>%
html_text(trim=TRUE)