2

I have searched around for the past few days and I see in XPath v2 you can use the 'except' operator, but haven't been able to figure out how xml2 can handle this.

This link is sort of what I want to do, but this is specific to XPath, and I'm trying to do a blanket exclusion of a node like in this SO answer.

For example, my test document is a .docx which I unzip and read. It has body text and a table. I want to read all the body text, except anything in a table. I can read both, but I can't figure out how to exclude all the w:tbl. Any not or except operators don't seem to work.

With xml_find_all it scrapes anything within those nodes, without exception.

bodytext <- xml2::xml_find_all(doc, "//w:p")
tabletext <- xml2::xml_find_all(doc, "//w:tbl")
zx485
  • 28,498
  • 28
  • 50
  • 59
Anonymous coward
  • 2,061
  • 1
  • 16
  • 29
  • Although thre is a "2" in the name, libxml2 does NOT implement the XPath 2.0 standard. It implements only XPath 1.0 , see http://xmlsoft.org/ . If you want XPath 2.0: See this answer from 2013: https://stackoverflow.com/a/16659329/202553 – knb Feb 15 '18 at 18:17
  • Please post a relevant sample of docx xml for us to help. You can use the xpath's brackets `[...]` – Parfait Feb 15 '18 at 20:56

1 Answers1

1

Here you are querying all existing w:p but w:tbl is containing instances of w:p - the following only select paragraph located in body:

xml2::xml_find_all(doc, "//w:body/w:p")

Note that to explore content of a docx, you can use officer::docx_summary(officer::read_docx('/path/to/document.docx')) that will return a data.frame with content, index, etc. as illustrated below.

  doc_index content_type style_name             text level num_id
1         1    paragraph       <NA>                     NA     NA
2         2    paragraph  heading 1 Table of content    NA     NA
3         3    paragraph       <NA>                     NA     NA
4         4    paragraph  heading 2     dataset iris    NA     NA
David Gohel
  • 9,180
  • 2
  • 16
  • 34
  • I was getting an error with that: "attempt to apply non-function." However, I removed "doc$doc_obj$get()" and it works. Thank you so much, now see how the node substructure can work. Also that officer::docx_summary is amazing. That might work even better. Thanks! – Anonymous coward Feb 15 '18 at 18:29