2

I tried using igaggini's example on this page but can;t seem to get it to work with my code. Scrapy: Follow link to get additional Item data?

I'm pretty sure I have the right xpaths, the output should be the second paragraph in the first div of the scraped links from the countries page.

Here is my main file, recursive.py.

from scrapy.spider import BaseSpider
from bathUni.items import BathuniItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from urlparse import urljoin

class recursiveSpider(BaseSpider):
name = 'recursive'
allowed_domains = ['http://www.bristol.ac.uk/']
start_urls = ['http://www.bristol.ac.uk/international/countries/']

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    links = []

    #scrap main page to get row links
    for i in range(1, 154):
        xpath = ('//*[@id="all-countries"]/li[*]/ul/li[*]/a' .format (i+1))
        link = hxs.select(xpath).extract()
        links.append(link)

    #parse links to get content of the linked pages
    for link in links:
        item = BathuniItem()
        item ['Qualification'] = hxs.select('//*[@id="uobcms-content"]/div/div/div[1]/p[2]')

        yield item

Here is my items file

from scrapy.item import Item, Field

class BathuniItem(Item):
    Country = Field()
    Qualification = Field()

And the output I receive is not what I want it to do, my csv file is full of these -

<HtmlXPathSelector xpath='//*[@id="all-countries"]/li[*]/ul/li[*]/a' data=u'<a href="/international/countries/albani'>
Community
  • 1
  • 1
Dyl10
  • 161
  • 2
  • 11
  • Just a hint: all that `[*]` predicates are unnecessary, eg. you match all list items that have any child elements, and then select those that are unsorted lists (which only returns a result if there are children anyway). – Jens Erat Feb 28 '14 at 23:21

1 Answers1

1

You should call .extract() on the selector to get a useful value instead of a SelectorList

item ['Qualification'] = hxs.select('//*[@id="uobcms-content"]/div/div/div[1]/p[2]').extract()

Another thing, I understand you want to fetch the pages corresponding to the links

#parse links to get content of the linked pages
for link in links:
    item = BathuniItem()
    item ['Qualification'] = hxs.select('//*[@id="uobcms-content"]/div/div/div[1]/p[2]').extract()

    yield item

That code will not fetch these linked pages, you need to yield additional Requests to tell Scrapy to download them.

You should do something like:

    start_urls = ['http://www.bristol.ac.uk/international/countries/']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = []

        #scrap main page to get row links
        for i in range(1, 154):
            xpath = ('//*[@id="all-countries"]/li[*]/ul/li[*]/a/@href' .format (i+1))
            links.extend(hxs.select(xpath).extract())

        #parse links to get content of the linked pages
        for link in links:
            yield Request(link, callback=self.parse_linkpage)

    def parse_linkpage(self, response):
        hxs = HtmlXPathSelector(response)
        item = BathuniItem()
        item ['Qualification'] = hxs.select('//*[@id="uobcms-content"]/div/div/div[1]/p[2]').extract()
        return item
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66