1

I am new to XPath, and I totally fail to parse a simple wiki-styled web page with lxml.

I have a following expression:

"".join(tree.xpath('//*[@id="mw-content-text"]/div[1]/p//text()'))

It works fine, but I need to exclude children whose class is "reference" and get a lxml.etree.XPathEvalError with a following expression:

"".join(tree.xpath('//*[@id="mw-content-text"]/div[1]/p//*[not(@class="reference")].text()'))

What is the right XPath expression? Thanks in advance :)

matchew
  • 19,195
  • 5
  • 44
  • 48
Ilya
  • 728
  • 2
  • 8
  • 22

1 Answers1

1

Probably, the error occured because of .text() instead of /text().

If you want include also text of p elements then you have to use the descendant-or-self XPath axis:

//*[@id="mw-content-text"]/div[1]/p/descendant-or-self::*[not(@class="reference")]/text()
howlger
  • 31,050
  • 11
  • 59
  • 99
  • Hi, could you please add some explanation to your code? This popped up in the review queue, as code-only answers tend to. – Will Jul 04 '16 at 04:34
  • 1
    Thanks, that's it! I understood it yesterday, and the final XPath expression is `//*[@id="mw-content-text"]/div[1]/p/descendant-or-self::*[not(ancestor::sup)]/text()`. – Ilya Jul 04 '16 at 07:28