I have a (slightly) invalid HTML document like this:
<p>
1
</p>
<p>
<div>
2
</div>
</p>
<p>
3
</p>
W3C doesn't allow a div
within a p
.
So Chrome (and a lot of other browsers I guess) implicitely correct the HTML code as follows (as described in this post: Putting <div> inside <p> is adding an extra <p>):
<p>
1
</p>
<p>
</p>
<div>
2
</div>
<p>
</p>
<p>
3
</p>
The browser corrected the HTML by adding a second p
and now there are 2 empty p
and a div
at root level. For the browser this is the real world and it will always say that there are 4 p
in total.
I'm building the xpath of the last p
with Javascript using prevSibling
as described here: How to use the Firebug xpath.js script?
prevSibling
is iterating over all 3 predecessing p
an returns /p[4]
for the last p
which is perfectly right in terms of the W3C definition (and the browser).
Nokogiri, though, is omitting the second empty p
and adds an error instead:
Unexpected end tag : p
For Nokogiri there are only 3 p
in total like this:
<p>
1
</p>
<p>
</p>
<div>
2
</div>
<p>
3
</p>
So when I get the "correct" xpath from my Javascript code (i.e. /p[4]
) I want to access the last p
with Nokogiri by using at_xpath("/p[4]")
. But I'm getting nil
because Nokogiri only has 3 p
.
How can I make Nokogiri handle the invalid HTML the same way the browser does (i.e. adding a second, empty p
) so I will get the last p
when accessing it by searching for /p[4]
?
` is always the last, you could try `at_xpath("/p[last()]")`
– Jack Fleeting Jul 26 '20 at 20:22