2

Consider this simple example

example_xml <- '<?xml version="1.0" encoding="UTF-8"?>
<file>
<book>
<text>abracadabra</text>
<node></node>
</book>
<book>
<text>hello world</text>
<node></node>
</book>
</file>'

myxml <- xml2::read_xml(example_xml)

Now, running this works as expected

> myxml %>% xml_find_all('//book')
{xml_nodeset (2)}
[1] <book>\n  <text>abracadabra</text>\n  <node/>\n</book>
[2] <book>\n  <text>hello world</text>\n  <node/>\n</book>

but looking for nodes whose text attribute contain wor does not.

> myxml %>% xml_find_all('//book[contains(@text, "wor")]')
{xml_nodeset (0)}

What is the problem here? How can I use regex (or partial string matching) with xml2 ?

Thanks!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

1 Answers1

2

The //book[contains(@text, "wor")] XPath finds book nodes that contain a text attribute (@ specifies an attribute) that contain wor in their values.

Your XML does not contain elements like <book text="Hello world">Title</book>, thus there are no results.

You may get the book nodes that contain wor in their text nodes using

> xml_find_all(myxml, '//book[contains(., "wor")]')
{xml_nodeset (1)}
[1] <book>\n  <text>hello world</text>\n  <node/>\n</book>

If you are fine with just text nodes as the return values, you may use

> xml_find_all(myxml, '//book/text[contains(., "wor")]')
{xml_nodeset (1)}
[1] <text>hello world</text>

If you need to get all book parents that contain any child nodes with wor text inside, use

> xml_find_all(myxml, '//*[contains(., "wor")]/parent::book')
{xml_nodeset (1)}
[1] <book>\n  <text>hello world</text>\n  <node/>\n</book>

See this answer to learn more about the difference between text() and .. In short, [contains(., "wor")] returns true if the string value of an element contains wor.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563