In scrapy I'm trying to retrieve a list of links then scrape data from these links in a single scrapy file. My csv file returns a list of xpaths.

Question

I tried using igaggini's example on this page but can;t seem to get it to work with my code. Scrapy: Follow link to get additional Item data?

I'm pretty sure I have the right xpaths, the output should be the second paragraph in the first div of the scraped links from the countries page.

Here is my main file, recursive.py.

from scrapy.spider import BaseSpider
from bathUni.items import BathuniItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from urlparse import urljoin

class recursiveSpider(BaseSpider):
name = 'recursive'
allowed_domains = ['http://www.bristol.ac.uk/']
start_urls = ['http://www.bristol.ac.uk/international/countries/']

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    links = []

    #scrap main page to get row links
    for i in range(1, 154):
        xpath = ('//*[@id="all-countries"]/li[*]/ul/li[*]/a' .format (i+1))
        link = hxs.select(xpath).extract()
        links.append(link)

    #parse links to get content of the linked pages
    for link in links:
        item = BathuniItem()
        item ['Qualification'] = hxs.select('//*[@id="uobcms-content"]/div/div/div[1]/p[2]')

        yield item

Here is my items file

from scrapy.item import Item, Field

class BathuniItem(Item):
    Country = Field()
    Qualification = Field()

And the output I receive is not what I want it to do, my csv file is full of these -

<HtmlXPathSelector xpath='//*[@id="all-countries"]/li[*]/ul/li[*]/a' data=u'<a href="/international/countries/albani'>

Just a hint: all that `[*]` predicates are unnecessary, eg. you match all list items that have any child elements, and then select those that are unsorted lists (which only returns a result if there are children anyway). — Jens Erat, Feb 28 '14 at 23:21

paul trmbrth · Answer 1 · 2014-03-01T15:55:18.143

You should call .extract() on the selector to get a useful value instead of a SelectorList

item ['Qualification'] = hxs.select('//*[@id="uobcms-content"]/div/div/div[1]/p[2]').extract()

Another thing, I understand you want to fetch the pages corresponding to the links

#parse links to get content of the linked pages
for link in links:
    item = BathuniItem()
    item ['Qualification'] = hxs.select('//*[@id="uobcms-content"]/div/div/div[1]/p[2]').extract()

    yield item

That code will not fetch these linked pages, you need to yield additional Requests to tell Scrapy to download them.

You should do something like:

    start_urls = ['http://www.bristol.ac.uk/international/countries/']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = []

        #scrap main page to get row links
        for i in range(1, 154):
            xpath = ('//*[@id="all-countries"]/li[*]/ul/li[*]/a/@href' .format (i+1))
            links.extend(hxs.select(xpath).extract())

        #parse links to get content of the linked pages
        for link in links:
            yield Request(link, callback=self.parse_linkpage)

    def parse_linkpage(self, response):
        hxs = HtmlXPathSelector(response)
        item = BathuniItem()
        item ['Qualification'] = hxs.select('//*[@id="uobcms-content"]/div/div/div[1]/p[2]').extract()
        return item

After trying to use this code I'm getting a TypeError Request url must be str or unicode, got list: — Dyl10, Mar 01 '14 at 15:26
I fixed that by using `links.extend(hxs.select(xpath).extract())`, `.extract())` returning a list. — paul trmbrth, Mar 01 '14 at 17:01

In scrapy I'm trying to retrieve a list of links then scrape data from these links in a single scrapy file. My csv file returns a list of xpaths.

1 Answers1