How to exclude a node using xml2

Question

I have searched around for the past few days and I see in XPath v2 you can use the 'except' operator, but haven't been able to figure out how xml2 can handle this.

This link is sort of what I want to do, but this is specific to XPath, and I'm trying to do a blanket exclusion of a node like in this SO answer.

For example, my test document is a .docx which I unzip and read. It has body text and a table. I want to read all the body text, except anything in a table. I can read both, but I can't figure out how to exclude all the w:tbl. Any not or except operators don't seem to work.

With xml_find_all it scrapes anything within those nodes, without exception.

bodytext <- xml2::xml_find_all(doc, "//w:p")
tabletext <- xml2::xml_find_all(doc, "//w:tbl")

Although thre is a "2" in the name, libxml2 does NOT implement the XPath 2.0 standard. It implements only XPath 1.0 , see http://xmlsoft.org/ . If you want XPath 2.0: See this answer from 2013: https://stackoverflow.com/a/16659329/202553 — knb, Feb 15 '18 at 18:17
Please post a relevant sample of docx xml for us to help. You can use the xpath's brackets `[...]` — Parfait, Feb 15 '18 at 20:56

score 1 · Accepted Answer · answered Feb 15 '18 at 18:02

Here you are querying all existing w:p but w:tbl is containing instances of w:p - the following only select paragraph located in body:

xml2::xml_find_all(doc, "//w:body/w:p")

Note that to explore content of a docx, you can use officer::docx_summary(officer::read_docx('/path/to/document.docx')) that will return a data.frame with content, index, etc. as illustrated below.

  doc_index content_type style_name             text level num_id
1         1    paragraph       <NA>                     NA     NA
2         2    paragraph  heading 1 Table of content    NA     NA
3         3    paragraph       <NA>                     NA     NA
4         4    paragraph  heading 2     dataset iris    NA     NA

I was getting an error with that: "attempt to apply non-function." However, I removed "doc$doc_obj$get()" and it works. Thank you so much, now see how the node substructure can work. Also that officer::docx_summary is amazing. That might work even better. Thanks! — Anonymous coward, Feb 15 '18 at 18:29

How to exclude a node using xml2

1 Answers1