3

My objective is to use the library(tm) toolkit on a pretty big word document. The word document has sensible typography, so we have h1 for the main sections, some h2and h3 subheadings. I want to compare and text mine each section (the text below each h1 - the subheadings is of little importance - so they can be included or excluded.)

My strategy is to export the worddocument to html and then use the rvestpacakge to extract the paragraphs.

library(rvest)
# the file has latin-1 chars
#Sys.setlocale(category="LC_ALL", locale="da_DK.UTF-8")
# small example html file
file <- rvest::html("https://83ae1009d5b31624828197160f04b932625a6af5.googledrive.com/host/0B9YtZi1ZH4VlaVVCTGlwV3ZqcWM/tidy.html", encoding = 'utf-8')

nodes <- file %>%
  rvest::html_nodes("h1>p") %>%
  rvest::html_text()

I can extract all the <p>with html_nodes("p"), but thats just one big soup. I need to analize each h1 separately.

The best would probably be a list, with a vector of p tags for each h1 heading. And maybe a loop with somehting like for (i in 1:length(html_nodes(fil, "h1"))) (html_children(html_nodes(fil, "h1")[i])) (which is not working).

Bonus if there is a way to tidy words html from within rvest

Andreas
  • 6,612
  • 14
  • 59
  • 69
  • You can use the [`htmltidy`](https://github.com/hrbrmstr/htmltidy) package (which wraps `libtidy`) to tidy ugly Word-generated HTML in R directly now. – hrbrmstr Sep 11 '16 at 13:19

1 Answers1

6

Note that > is the child combinator; the selector that you currently have looks for p elements that are children of an h1, which doesn't make sense in HTML and so returns nothing.

If you inspect the generated markup, at least in the example document that you've provided, you'll notice that every h1 element (as well as the heading for the table of contents, which is marked up as a p instead) has an associated parent div:

<body lang="EN-US">
  <div class="WordSection1">
    <p class="MsoTocHeading"><span lang="DA" class='c1'>Indholdsfortegnelse</span></p>
    ...
  </div><span lang="DA" class='c5'><br clear="all" class='c4'></span>

  <div class="WordSection2">
    <h1><a name="_Toc285441761"><span lang="DA">Interview med Jakob skoleleder på
    a_skolen</span></a></h1>
    ...
  </div><span lang="DA" class='c5'><br clear="all" class='c4'></span>

  <div class="WordSection3">
    <h1><a name="_Toc285441762"><span lang="DA">Interviewet med Andreas skoleleder på
    b_skolen</span></a></h1>
    ...
  </div>
</body>

All of the p elements in each section denoted by an h1 are found in its respective parent div. With this in mind, you could simply select p elements that are siblings of each h1. However, since rvest doesn't currently have a way to select siblings from a context node (html_nodes() only supports looking at a node's subtree, i.e. its descendants), you will need to do this another way.

Assuming HTML Tidy creates a structure where every h1 is in a div that is directly within body, you can grab every div except the table of contents using the following selector:

sections <- html_nodes(file, "body > div ~ div")

In your example document, this should result in div.WordSection2 and div.WordSection3. The table of contents is represented by div.WordSection1, and that is excluded from the selection.

Then extract the paragraphs from each div:

for (section in sections) {
  paras <- html_nodes(section, "p")
  # Do stuff with paragraphs in each section...

  print(length(paras))
}
# [1] 9
# [1] 8

As you can see, length(paras) corresponds to the number of p elements in each div. Note that some of them contain nothing but an &nbsp; which may be troublesome depending on your needs. I'll leave dealing with those outliers as an exercise to the reader.

Unfortunately, no bonus points for me as rvest does not provide its own HTML Tidy functionality. You will need to process your Word documents separately.

Community
  • 1
  • 1
BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356