1

Say I have a structure like that;

<div id="body">
<h1> Title </h1>
<p> Date Created </p>
<p class="text-bold"> Description </p>
<p> Para1 </p>
<p> Para2 </p>
<p> Para..</p>
<p> ParaN </p>

I am trying to get Para1 to ParaN appended together. To add onto it, Para1 in certain links might be placed as //p[5] and sometimes as //p[6].

So by running a default of,

def parse_details(self, response):
        item = response.meta["item"]
        item['Message'] = response.xpath('//p[x]/text()').extract()

        yield item

It will sometimes fail and return a wrong field as x is somewhat dynamic. What stays constant is that I need all fields under <p class="text-bold"> Description </p>.

Is there anyway to do it?

BernardL
  • 5,162
  • 7
  • 28
  • 47

1 Answers1

2

If you need all fields after <p class="text-bold"> Description </p> you can simply try xpath following-sibling function

html = """
<div id="body">
<h1> Title </h1>
<p> Date Created </p>
<p class="text-bold"> Description </p>
<p> Para1 </p>
<p> Para2 </p>
<p> Para..</p>
<p> ParaN </p>
"""

from scrapy import Selector
sel = Selector(text=html)
xpath = "//p[contains(text(), 'Description')]/following-sibling::p/text()"
r = sel.xpath(xpath).extract()
print(r)
# [u' Para1 ', u' Para2 ', u' Para..', u' ParaN ']
Community
  • 1
  • 1
Pawel Miech
  • 7,742
  • 4
  • 36
  • 57
  • Thanks! Worked well. I seem to find the Scrapy documentation a bit difficult to understand. But hopefully I am able to burn in more. Any ideas where I can learn more about the syntax and examples? – BernardL May 17 '16 at 07:56
  • 1
    this is not really Scrapy, this is xpath documentation. I agree xpath docs suck. I found microsoft xpath docs useful: https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx mdn has a page too: https://developer.mozilla.org/en-US/docs/Web/XPath Remember you can always use css selectors (they are better documented but xpaths are sometimes more powerfull), response.css() is always an option in Scrapy – Pawel Miech May 17 '16 at 08:07
  • Thanks for the guidance. Will definitely explore further. – BernardL May 17 '16 at 08:18