1

I want to get datas from multipage with scrapy. And if I want to get the data from the second page, I should use cookies to pass the search term.(Because the search term won't emerge in the url)
The first page's url is :

http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky

The second page's url is :

http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=10

I have seen so many questions in stack overflow, they all know what the cookies is before they crawl the data.But I can only get the cookies when I finish crawling the first page. So I want to know how to do with this? Here is my code:

__author__ = 'Rabbit'
from scrapy.spiders import Spider
from scrapy.selector import Selector

from scrapy_Data.items import EPGD


class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    stmp = []
    term = "man"
    url_base = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky"

    start_urls = stmp

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')

        for site in sites:
            item = EPGD()
            item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
            item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
            item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
            item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
            item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
            item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
            yield item
Coding_Rabbit
  • 1,287
  • 3
  • 22
  • 44

2 Answers2

0

Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does, check more information here

I don't see how you are paginating on your code, but it should be something like this:

class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    stmp = []
    term = "man"
    my_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky"]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')

        for site in sites:
            item = EPGD()
            item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
            item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
            item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
            item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
            item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
            item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
            yield item
        yield Request('http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=10', 
                       callback=self.parse_second_url)

     def parse_second_url(self, response):
         # do your thing

The second request carries the cookies from the first request.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
-1

I just saw that you posted here the same problem that you already posted earlier in this post and which I answered yesterday. So I post my answer here again and leave it up to the moderators...

When link parsing and Request yielding is added to your parse() function, your example just works for me. Maybe the page generates some server-side cookies. But using a proxy service like Scrapy's Crawlera (which downloads from multiple IPs) it fails though.

The solution is to enter the 'textquery' parameter manually into the request url:

import urlparse
from urllib import urlencode

from scrapy import Request
from scrapy.spiders import Spider
from scrapy.selector import Selector


class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = 'calb'
    base_url = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=0&textquery=%s"
    start_urls = [base_url % term]

    def update_url(self, url, params):
        url_parts = list(urlparse.urlparse(url))
        query = dict(urlparse.parse_qsl(url_parts[4]))
        query.update(params)
        url_parts[4] = urlencode(query)
        url = urlparse.urlunparse(url_parts)
        return url

    def parse(self, response):
        sel = Selector(response)
        genes = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')

        for gene in genes:
            item = {}
            item['genID'] = map(unicode.strip, gene.xpath('td[1]/a/text()').extract())
            # ...
            yield item

        urls = sel.xpath('//div[@id="nviRecords"]/span[@id="quickPage"]/a/@href').extract()
        for url in urls:
            url = response.urljoin(url)
            url = self.update_url(url, params={'textquery': self.term})
            yield Request(url)

update_url() function details from Lukasz' solution:
Add params to given URL in Python

dron22
  • 1,235
  • 10
  • 20