0

I try to create a dataframe in R from an HTML file. The HTML file contains several articles of a website and consists of a heading and following paragraphs. I would like to write both the headlines and the full text of the article in a vector each.

I use rvest and the command html_nodes and the CSS formatting, which works quite well so far. But I can't create a dataframe, because the number of headings and paragraphs is not the same: An article logically consists of several paragraphs and this number differs regarding every article.

How do I write a code that explains to R that I want to sum up allparagraphs of an article into one vector?

This is the code I have so far:

site <- read_html("Local Path")

heading <- html_text(html_nodes(x=site, ".counted"))
heading <- gsub('\"', "", heading, fixed = TRUE)
heading  

fulltext <- html_text(html_nodes(x=site, ".dearticleParagraph"))
fulltext <- gsub("\r\n", "", fulltext, fixed = TRUE)
head(fulltext)

dataframe <-data.frame(Heading = heading, Full Text = fulltext, stringsAsFactors = FALSE))

You can find an example of the HTML-files here: https://seafile.zfn.uni-bremen.de/f/2648cd4c7a7a429a9c7d/?dl=1

Thanks a lot.

Best regards, Echoes

Echoes
  • 11
  • 1
  • It would be a lot easier to answer your question if you would share more of your code and a bit of data to test it: https://stackoverflow.com/a/5963610/5028841 – JBGruber Sep 10 '18 at 16:55
  • Hi, thank for your answer. I updated the code above. The HTML file is not online, but only saved on my local harddrive. I uploaded an example: https://www.file-upload.net/download-13310882/output_1-100.html.html – Echoes Sep 11 '18 at 07:14
  • I can't achieve to download. It gives an .exe file which seemed like a trojan to me. Maybe you can save it as a .txt file and upload it on somewhere else. – maydin Sep 11 '18 at 07:30
  • Sorry. Should be on the filehost. Next try: HTML: https://seafile.zfn.uni-bremen.de/f/2648cd4c7a7a429a9c7d/?dl=1 TXT: https://seafile.zfn.uni-bremen.de/f/73baba2439fb4456ab75/?dl=1 – Echoes Sep 11 '18 at 07:56
  • I found this code: https://koheiw.net/?p=11 It works partly, but is still not perfect. My knowledge of R is not good enough to edit the code. Does anyone have a solution? – Echoes Sep 17 '18 at 08:42

0 Answers0