My objective is to use the library(tm)
toolkit on a pretty big word document. The word document has sensible typography, so we have h1
for the main sections, some h2
and h3
subheadings. I want to compare and text mine each section (the text below each h1
- the subheadings is of little importance - so they can be included or excluded.)
My strategy is to export the worddocument to html and then use the rvest
pacakge to extract the paragraphs.
library(rvest)
# the file has latin-1 chars
#Sys.setlocale(category="LC_ALL", locale="da_DK.UTF-8")
# small example html file
file <- rvest::html("https://83ae1009d5b31624828197160f04b932625a6af5.googledrive.com/host/0B9YtZi1ZH4VlaVVCTGlwV3ZqcWM/tidy.html", encoding = 'utf-8')
nodes <- file %>%
rvest::html_nodes("h1>p") %>%
rvest::html_text()
I can extract all the <p>
with html_nodes("p")
, but thats just one big soup. I need to analize each h1
separately.
The best would probably be a list, with a vector of p
tags for each h1
heading. And maybe a loop with somehting like for (i in 1:length(html_nodes(fil, "h1"))) (html_children(html_nodes(fil, "h1")[i]))
(which is not working).
Bonus if there is a way to tidy words html from within rvest