I try to create a dataframe in R from an HTML file. The HTML file contains several articles of a website and consists of a heading and following paragraphs. I would like to write both the headlines and the full text of the article in a vector each.
I use rvest and the command html_nodes and the CSS formatting, which works quite well so far. But I can't create a dataframe, because the number of headings and paragraphs is not the same: An article logically consists of several paragraphs and this number differs regarding every article.
How do I write a code that explains to R that I want to sum up allparagraphs of an article into one vector?
This is the code I have so far:
site <- read_html("Local Path")
heading <- html_text(html_nodes(x=site, ".counted"))
heading <- gsub('\"', "", heading, fixed = TRUE)
heading
fulltext <- html_text(html_nodes(x=site, ".dearticleParagraph"))
fulltext <- gsub("\r\n", "", fulltext, fixed = TRUE)
head(fulltext)
dataframe <-data.frame(Heading = heading, Full Text = fulltext, stringsAsFactors = FALSE))
You can find an example of the HTML-files here: https://seafile.zfn.uni-bremen.de/f/2648cd4c7a7a429a9c7d/?dl=1
Thanks a lot.
Best regards, Echoes