Why does xpath find excluded nodes again?

Question

Consider this page:

<n1 class="a">
  1
</n1>
<n1 class="b">
  <b>bold</b>
  2
</n1>

If I first select the first n1 using class="a", I should be excluding the second n1, and indeed this appears true:

library(rvest)
b_nodes = read_html('<n1 class="a">1</n1>
<n1 class="b"><b>bold</b>2</n1>') %>% 
  html_nodes(xpath = '//n1[@class="b"]')
b_nodes
# {xml_nodeset (1)}
# [1] <n1 class="b"><b>bold</b>2</n1>

However if we now use this "subsetted" page:

b_nodes %>% html_nodes(xpath = '//n1')
# {xml_nodeset (2)}
# [1] <n1 class="a">1</n1>
# [2] <n1 class="b"><b>bold</b>2</n1>

How did the 1 node get "re-discovered"??

Note: I know how to get what I want with two separate xpaths. This is a conceptual question about why the "subsetting" didn't work as expected. My understanding was that b_nodes should have excluded the first node altogether -- the b_nodes object shouldn't even know that node exists.

A simpler illustration: `b_nodes %>% html_nodes(xpath = '//n1')`. Looks like it's not designed to be chained / consistently points at the original object. — Frank, Feb 10 '17 at 19:39
@Frank good point, I thought I'd tried that and it didn't work. I'll edit it in — MichaelChirico, Feb 10 '17 at 19:40

宏杰李 · Answer 1 · 2017-02-10T20:00:06.400

2

html_nodes(xpath = '//n1')

// is short for /descendant-or-self::n1, the current node is the whole document

change it to .//n1, . means the current node is what you selected before

edited Feb 10 '17 at 20:00

answered Feb 10 '17 at 19:44

宏杰李

11,820
2
28
35

@MichaelChirico when you use `//` you are query the whole document, no matter what sub-node you are. – 宏杰李 Feb 10 '17 at 19:59
I still find it strange that the other nodes weren't deleted by the first expression. i.e., the "whole document", to me, should only be `...`, which is what displays from `b_nodes`. – MichaelChirico Feb 10 '17 at 21:08

Carlos M Gomez · Answer 2 · 2017-02-10T20:51:12.093

0

I am not shure what are you trying to do, but, Why do not you try to traverse the nodes with a foreach? I mean:

$XML = read_html('
<n1s>
<n1 class="a">1</n1>
<n1 class="b"><b>bold</b>2</n1></n1s>') %>%


$valueA = '';
$valueB = '';    
foreach ($XML->xpath('//n1') as $n1) {
        switch ((string)$n1['class']){
              case 'a':
                    $valueA = $XML->n1;
                     break;
              case 'b':
                    $valueB = $XML->n1;
                     break;
        }
    }

I hope this can help you. Regards!

edited Feb 10 '17 at 20:51

answered Feb 10 '17 at 19:53

Carlos M Gomez

65
2
8

See edit, this isn't a question of how to do something – MichaelChirico Feb 10 '17 at 19:55

Why does xpath find excluded nodes again?

2 Answers2

Linked