0

For the code below I need to get dates and their times+hrefs+formats+...(not shown) respectively.

<div class="showtimes">
    <h2>The Little Prince</h2>

    <div class="poster" data-poster-url="http://www.test.com">
        <img src="http://www.test.com">
    </div>

    <div class="showstimes">

        <div class="date">9 December, Wednesday</div>
        <span class="show-time techno-3d">
            <a href="http://www.test.com" class="link">12:30</a>
            <span class="show-format">3D</span>
        </span>

        <span class="show-time techno-3d">
            <a href="http://www.test.com" class="link">15:30</a>
            <span class="show-format">3D</span>
        </span>

        <span class="show-time techno-3d">
            <a href="http://www.test.com" class="link">18:30</a>
            <span class="show-format">3D</span>
        </span>


        <div class="date">10 December, Thursday</div>
        <span class="show-time techno-2d">
            <a href="http://www.test.com" class="link">12:30</a>
            <span class="show-format">2D</span>         
        </span>

        <span class="show-time techno-3d">
            <a href="http://www.test.com" class="link">15:30</a>
            <span class="show-format">3D</span>
        </span>
    </div>
</div>

To do this, I use this code (python).

for dates in movie.xpath('.//div[@class="showstimes"]/div[@class="date"]'):
    date = dates.xpath('.//text()')[0]

    # for times in dates.xpath('//following-sibling::span[1 = count(preceding-sibling::div[1] | (.//div[@class="date"])[1])]'):
    # for times in dates.xpath('//following-sibling::span[contains(@class,"show-time")]'):
    # for times in dates.xpath('.//../span[contains(@class,"show-time")]'):
    # for times in dates.xpath('//following-sibling::span[preceding-sibling::div[1][.="date"]]'):
        time = times.xpath('.//a/text()')[0]
        url = times.xpath('.//a/@href')[0]
        format_type = times.xpath('.//span[@class="show-format"]/text()')[0]

To get dates is not a problem, but I have a problem how to get the rest info for particular date respectively. Tried many different ways - no luck (in comments some of them). I can't find the way how to deal with the case when the nodes that I need are one under another (on the same level?). In this case:

-> div Date1
-> span Time1
-> span href1
-> span Format1

-> span Time2
-> span href2
-> span Format2

-> span Time3
-> span href3
-> span Format3

-> div Date2
-> span Time1
-> span href1
-> span Format1
# etc etc
har07
  • 88,338
  • 12
  • 84
  • 137
TitanFighter
  • 4,582
  • 3
  • 45
  • 73

1 Answers1

0

Turns out that lxml support referencing python variable from XPath expression, which proven to be useful for this case i.e for every div date, you can get the following sibling span which the nearest preceding sibling div date is the current div date, where reference to the current div date is stored in python variable dates :

for dates in movie.xpath('.//div[@class="showstimes"]/div[@class="date"]'):
    date = dates.xpath('normalize-space()')
    for times in dates.xpath('following-sibling::span[preceding-sibling::div[1]=$current]', current=dates):
        time = times.xpath('a/text()')[0]
        url = times.xpath('a/@href')[0]
        format_type = times.xpath('span/text()')[0]
        print date, time, url, format_type

output :

'9 December, Wednesday', '12:30', 'http://www.test.com', '3D'
'9 December, Wednesday', '15:30', 'http://www.test.com', '3D'
'9 December, Wednesday', '18:30', 'http://www.test.com', '3D'
'10 December, Thursday', '12:30', 'http://www.test.com', '2D'
'10 December, Thursday', '15:30', 'http://www.test.com', '3D'

References :

Community
  • 1
  • 1
har07
  • 88,338
  • 12
  • 84
  • 137