1

I have HTML code like this:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width">
        <title>test</title>
    </head>
    <body>
        <h3><a href="#" name='title1'>Title</a></h3>
        <div>para1</div>
        <div>para2</div>
        <div>para3</div>
        <h3><a href="#" name='title2'>Title</a></h3>
        <div>para4</div>
        <div>para5</div>
    </body>
</html>

What I want is:

<div>para1</div>
<div>para2</div>
<div>para3</div>

So I want to get the first part in this html. I need to ignore the second part.

for now I just work out this way:

#!/usr/bin/env python
# encoding: utf-8

import unittest

from lxml import etree

class SearchPara(unittest.TestCase):

    def setUp(self):
        with open('test.html') as f:
            self.html = f.read()

    def test_parse_html(self):
        paras = ''
        page = etree.HTML(self.html)
        a_ele = page.xpath("//h3/a[@name='title1']/..")

        if a_ele is None or len(a_ele) < 1:
            return paras

        para = a_ele[0].xpath('following-sibling::*[1][name(.) != "h3"]')
        while para is not None and len(para) > 0:
            print para
            paras += etree.tostring(para[0])
            para = para[0].xpath('following-sibling::*[1][name(.) != "h3"]')

        print paras


    def tearDown(self):
      pass

if __name__ == "__main__":
    unittest.main()

As you can see, this is a little bit complicated, what I want to know if I have beeter way to do this?

roger
  • 9,063
  • 20
  • 72
  • 119
  • @mins you puzzled me, `p[a]` means `All P with an A child`, but the `
    ` are not children of ``
    – roger Apr 11 '15 at 10:01
  • @mins I want to get the first`
    para1
    para2
    para3
    ` and the second `
    para4
    para5
    ` as my result
    – roger Apr 11 '15 at 10:03
  • Like you edited it, now it seems you want to select all
    regardless of other criteria. Maybe you should clarify what you want with a better example.
    – mins Apr 11 '15 at 10:10
  • @mins yes, exactly. In fact, I am doing this because I want to get all paragraphs associate with a title. Can I do this? – roger Apr 11 '15 at 10:16
  • You want to get all siblings from a starting point, until some sibling is found: See [XPath : select all following siblings until another sibling](http://stackoverflow.com/questions/2161766/xpath-select-all-following-siblings-until-another-sibling) – mins Apr 11 '15 at 10:21
  • 1
    roger, your question is not clear. It is unclear what the rules are for selecting nodes in your example. Please edit your question and show _another_ sample. A [**good sample**](http://stackoverflow.com/help/mcve) contains all the complexity present in your actual data. Show us a sample where some of the `div` elements should _not_ be selected. Also, samples must be **complete** documents. – Mathias Müller Apr 11 '15 at 22:38
  • Please show us what you tried so far. Furthermore, your assertion does not apply. The ` – Markus W Mahlberg Apr 12 '15 at 14:34
  • @MarkusWMahlberg as you insists, I have give you all codes, you can test in your machine. – roger Apr 13 '15 at 04:29
  • @MathiasMüller I have change my question completely, maybe this is clearly for now. – roger Apr 13 '15 at 04:30
  • @roger: I do not insist on my purpose. You might want to read [How do I ask a good question](http://stackoverflow.com/help/how-to-ask), which enhances the probability for getting a useful answer _drastically_. You might find [ESR](https://en.m.wikipedia.org/wiki/Eric_S._Raymond)'s excellent essay [How To Ask Questions The Smart Way](http://catb.org/~esr/faqs/smart-questions.html) helpful, too. – Markus W Mahlberg Apr 13 '15 at 10:18

1 Answers1

1

As far as I know, there is no general way to select elements between 2 elements using XPath 1.0.

The same output still can be achieved if we can define the assertion differently. For example, by selecting <div>s having nearest preceding sibling <a> value equals "Title: Part I" :

//div[preceding-sibling::a[1][. = 'Title: Part I']]

and selecting the next <div>s group only require changing the <a> criteria :

//div[preceding-sibling::a[1][. = 'Title: Part II']]

The test method to see above xpath in action :

def test_parse_html(self):
    page = etree.HTML(self.html)
    paras = ''
    para = page.xpath("//div[preceding-sibling::a[1][. = 'Title: Part I']]")
    for p in para:
        paras += etree.tostring(p)

    print paras

Side note. xpath for populating a_ele in your code can be simplified this way :

a_ele = page.xpath("//a[h3 = 'Title: Part I']")

or even further, since the only text element within the <a> is "Title: Part I" :

a_ele = page.xpath("//a[. = 'Title: Part I']")
har07
  • 88,338
  • 12
  • 84
  • 137
  • I am so so so so sorry, I need to bother you again, my situation is a little bit different now, I tried to use your tips to simplify my program, can you give me a tip again? – roger Apr 13 '15 at 06:40
  • Same approach with slight change to the xpath expression : `//div[preceding-sibling::h3[1][a/@name = 'title1']]` ? – har07 Apr 13 '15 at 06:58