Nokogiri: x-path query fails when HTML code is invalid

Question

I have a (slightly) invalid HTML document like this:

<p>
    1
</p>
<p>
    <div>
        2
    </div>
</p>
<p>
    3
</p>

W3C doesn't allow a div within a p.

So Chrome (and a lot of other browsers I guess) implicitely correct the HTML code as follows (as described in this post: Putting <div> inside <p> is adding an extra <p>):

<p>
    1
</p>
<p>
</p>
<div>
    2
</div>
<p>
</p>
<p>
    3
</p>

The browser corrected the HTML by adding a second p and now there are 2 empty p and a div at root level. For the browser this is the real world and it will always say that there are 4 p in total.

I'm building the xpath of the last p with Javascript using prevSibling as described here: How to use the Firebug xpath.js script?

prevSibling is iterating over all 3 predecessing p an returns /p[4] for the last p which is perfectly right in terms of the W3C definition (and the browser).

Nokogiri, though, is omitting the second empty p and adds an error instead:

Unexpected end tag : p

For Nokogiri there are only 3 p in total like this:

<p>
    1
</p>
<p>
</p>
<div>
    2
</div>
<p>
    3
</p>

So when I get the "correct" xpath from my Javascript code (i.e. /p[4]) I want to access the last p with Nokogiri by using at_xpath("/p[4]"). But I'm getting nil because Nokogiri only has 3 p.

How can I make Nokogiri handle the invalid HTML the same way the browser does (i.e. adding a second, empty p) so I will get the last p when accessing it by searching for /p[4]?

If your target `
` is always the last, you could try `at_xpath("/p[last()]")` — Jack Fleeting, Jul 26 '20 at 20:22
Whenever HTML is invalid Nokogiri will attempt to fix it. Sometimes it can't and the resulting DOM won't reflect what you think it should. Your choice is to look at what Nokogiri gets after fixing the file and change your selector, or write code and use string manipulation to tweak the problem area into proper HTML, then pass it to Nokogiri. — the Tin Man, Jul 26 '20 at 23:10
Also, you can't rely on what a browser says; use `wget` or `curl` and view it in an editor or `nokogiri` at the command-line to retrieve the file and look at it in Nokogiri's IRB session. Trying to get Nokogiri to be broken like a browser isn't going to work well. I'd recommend asking on the Nokogiri IRC channel. — the Tin Man, Jul 26 '20 at 23:13
@JackFleeting: This is just an example. If there was a fourth valid `p` I still want the third `p` instead of the last one. — user2148956, Jul 29 '20 at 16:55
@theTinMan: Of course implicit error correction in different browsers can always lead to different results - one browser fixes the problem by adding a extra `p` while another one is omitting the closing `p`. Since I'm using Chrome per default I know that I have to handle an extra `p` - which is absolutely correct/OK with respect to the W3C spec. So the browser is not "broken". It is a common way to handle invalid HTML. So I'm wondering why Nokorigi isn't flexible enough to act the same way as most of the common browsers do (I tested Chrome, Firefox and IE which all add an extra `p`)? — user2148956, Jul 29 '20 at 16:59
So is it possible to make Nokogiri use the same HTML parser/engine as Chrome (i.e. Blink) or Firefox (i.e. Gecko)? — user2148956, Jul 29 '20 at 17:10

Nokogiri: x-path query fails when HTML code is invalid

0 Answers0