2

I'm using Scrapy for scrape data from this page

https://www.bricoetloisirs.ch/magasins/gardena

Product list appears dynamically. Find url to get products

https://www.bricoetloisirs.ch/coop/ajax/nextPage/(cpgnum=1&layout=7.01-14_180_69_164_182&uiarea=2&carea=%24ROOT&fwrd=frwd0&cpgsize=12)/.do?page=2&_=1473841539272

But when i scrape it by Scrapy it give me empty page

<span class="pageSizeInformation" id="page0" data-page="0" data-pagesize="12">Page: 0 / Size: 12</span>

Here is my code

# -*- coding: utf-8 -*-
import scrapy

from v4.items import Product


class GardenaCoopBricoLoisirsSpider(scrapy.Spider):
    name = "Gardena_Coop_Brico_Loisirs_py"

    start_urls = [
            'https://www.bricoetloisirs.ch/coop/ajax/nextPage/(cpgnum=1&layout=7.01-14_180_69_164_182&uiarea=2&carea=%24ROOT&fwrd=frwd0&cpgsize=12)/.do?page=2&_=1473841539272'
        ]

    def parse(self, response):
        print response.body
Andrew Gowa
  • 119
  • 1
  • 2
  • 7
  • becuase thats what you get when you hit the url stated in your start_urls – dnit13 Sep 14 '16 at 08:58
  • 1
    Your issue seems to lie in cookies. Have you tried having just the `https://www.bricoetloisirs.ch/magasins/gardena` in start urls and then yield the ajax request? Scrapy manages cookies automatically so all you need to do is replicated the request chain and some of the headers and you should receive the same response. – Granitosaurus Sep 14 '16 at 09:03
  • 1
    @Granitosaurus was right. – Andrew Gowa Sep 14 '16 at 12:33

3 Answers3

3

I solve this.

# -*- coding: utf-8 -*-
import scrapy

from v4.items import Product


class GardenaCoopBricoLoisirsSpider(scrapy.Spider):
    name = "Gardena_Coop_Brico_Loisirs_py"

    start_urls = [
            'https://www.bricoetloisirs.ch/magasins/gardena'
        ]

    def parse(self, response):
        for page in xrange(1, 50):
            url = response.url + '/.do?page=%s&_=1473841539272' % page
            yield scrapy.Request(url, callback=self.parse_page)

    def parse_page(self, response):
        print response.body
Andrew Gowa
  • 119
  • 1
  • 2
  • 7
  • Please edit your answer and elaborate what was the problem and how you solved it. – kchomski Mar 19 '17 at 23:01
  • 1
    @kchomski If you read through the description, he wanted to load */.do?page=%s&_=1473841539272* and was unable to cause he needed a cookie. He loaded *https://www.bricoetloisirs.ch/magasins/gardena* instead at first to generate a cookie from the server and then went along and got what he wanted. – AturSams Mar 20 '19 at 05:42
1

As far as i know websites use JavaScript to make Ajax calls.
when you use scrapy the page's JS dose not load.

You will need to take a look at Selenium for scraping those kind of pages.

Or find out what ajax calls are being made and send them yourself.
check this Can scrapy be used to scrape dynamic content from websites that are using AJAX? may help you as well

Community
  • 1
  • 1
Urban48
  • 1,398
  • 1
  • 13
  • 26
  • 2
    That's not true. Ajax is just an asynchronous request that can be easily replicated with scrapy or anything else for that matter. It's true however, that you can use something like selenium to render the page with all of the ajax requests and bells and whistles if you are looking for lazy, do-it-all approach. – Granitosaurus Sep 14 '16 at 09:01
  • @Granitosaurus ye you are right, thats why i linked another answer where they talk about analyzing the network calls and simulating ajax requests – Urban48 Sep 14 '16 at 09:34
0

I believe you need to send an additional request just like a browser does. Try to modify your code as follows:

# -*- coding: utf-8 -*-
import scrapy

from scrapy.http import Request
from v4.items import Product


class GardenaCoopBricoLoisirsSpider(scrapy.Spider):
    name = "Gardena_Coop_Brico_Loisirs_py"

    start_urls = [
        'https://www.bricoetloisirs.ch/coop/ajax/nextPage/'
    ]

    def parse(self, response):
        request_body = '(cpgnum=1&layout=7.01-14_180_69_164_182&uiarea=2&carea=%24ROOT&fwrd=frwd0&cpgsize=12)/.do?page=2&_=1473841539272'
        yield Request(url=response.url, body=request_body, callback=self.parse_page)

    def parse_page(self, response):
        print response.body
xanderdin
  • 1
  • 1