1

I have scraped a page with this html content:

<div class="td-ss-main-content">
  <div class="td-page-header">...</div>
  <div class="td_module_16 td_module_wrap td-animation-stack">...</div>
  <div class="td_module_16 td_module_wrap td-animation-stack td_module_no_thumb">...</div>
  <div class="page-nav td-pb-padding-side">
    <span class="current">1</span>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/2/" class="page" title="2">2</a>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/3/" class="page" title="3">3</a>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/2/"><i class="td-icon-menu-right"></i></a>
    <span class="pages">Page 1 of 3</span>
  </div>
</div>

Now I would like to get the next page link if its present which is in the a href value of .page-nav > a which has an i tag.

I can do this:

response.css("div.page-nav > a")[2].css("::attr(href)").extract_first()

But this won't work if I am on page 2. So it is better to get the value of a tag if it has a child element of an i tag. How can I achieve that?

update (page 2)

<div class="page-nav td-pb-padding-side">
    <a href="http://www.arunachaltimes.in/2017/05/06/"><i class="td-icon-menu-left"></i></a>
    <a href="http://www.arunachaltimes.in/2017/05/06/" class="page" title="1">1</a>
    <span class="current">2</span>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/3/" class="page" title="3">3</a>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/3/"><i class="td-icon-menu-right"></i></a>
    <span class="pages">Page 2 of 3</span>
</div>

update (page 3 last page)

<div class="page-nav td-pb-padding-side">
    <a href="http://www.arunachaltimes.in/2017/05/06/page/2/"><i class="td-icon-menu-left"></i></a>
    <a href="http://www.arunachaltimes.in/2017/05/06/" class="page" title="1">1</a>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/2/" class="page" title="2">2</a>
    <span class="current">3</span>
    <span class="pages">Page 3 of 3</span>
</div>
Robin
  • 5,366
  • 17
  • 57
  • 87

1 Answers1

2

You can achieve it with an XPath expression:

//div[contains(concat(' ', @class, ' '), ' page-nav ')]/a[contains(concat(' ', i/@class, ' '), ' td-icon-menu-right ')]/@href

Note that, to avoid false positives, we are using concat for the class attribute check.

Demo:

$ scrapy shell file:////$PWD/index.html
In [1]: response.xpath("//div[contains(concat(' ', @class, ' '), ' page-nav ')]/a[contains(concat(' ', i/@class, ' '), ' td-icon-menu-right ')]/@href").extract_first()
Out[1]: u'http://www.arunachaltimes.in/2017/05/06/page/2/'
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I am sorry, but the Xpath expression is not working. If I am on second page, its showing 1st page. And if I am on 3rd (last) page, its showing the 2nd page. – Robin May 06 '17 at 17:21
  • @Robin could it be that your requirement to have an `i` element inside the `a` is not valid? I just followed the instructions. Could you post how the HTML looks if you are on the 2nd page? – alecxe May 06 '17 at 17:23
  • Even the css version is not working. If I am on 2nd page, it gets the correct url. But if I am on the 3rd (last) page, it gets back the 2nd page. – Robin May 06 '17 at 17:25
  • @Robin okay, it's not just `i` then - it's `i` with the `td-icon-menu-right` class. Please check the updated answer. – alecxe May 06 '17 at 17:39
  • Yes, I was being careless. Thank you. – Robin May 06 '17 at 17:48