Parsing a wiki-styled web page, XPath error

Question

I am new to XPath, and I totally fail to parse a simple wiki-styled web page with lxml.

I have a following expression:

"".join(tree.xpath('//*[@id="mw-content-text"]/div[1]/p//text()'))

It works fine, but I need to exclude children whose class is "reference" and get a lxml.etree.XPathEvalError with a following expression:

"".join(tree.xpath('//*[@id="mw-content-text"]/div[1]/p//*[not(@class="reference")].text()'))

What is the right XPath expression? Thanks in advance :)

howlger · Accepted Answer · 2016-07-04T07:17:44.763

1

Probably, the error occured because of .text() instead of /text().

If you want include also text of p elements then you have to use the descendant-or-self XPath axis:

//*[@id="mw-content-text"]/div[1]/p/descendant-or-self::*[not(@class="reference")]/text()

edited Jul 04 '16 at 07:17

answered Jul 03 '16 at 21:07

howlger

Hi, could you please add some explanation to your code? This popped up in the review queue, as code-only answers tend to. – Will Jul 04 '16 at 04:34
1

Thanks, that's it! I understood it yesterday, and the final XPath expression is `//*[@id="mw-content-text"]/div[1]/p/descendant-or-self::*[not(ancestor::sup)]/text()`. – Ilya Jul 04 '16 at 07:28

1 Answers1