0

I am new to scrapy and python and am having trouble understanding the flow. From this site https://books.toscrape.com/

I am looking for how to gather all the categories of different types of books using scrapy. Then determine, how many books are listed in each category and export the result to JSON format.

Which software can I use VSC or spyder?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
keys
  • 1
  • You can use whatever IDE you prefer. Please add your code so we can help you better, and read [how to ask](https://stackoverflow.com/help/how-to-ask) – SuperUser Dec 23 '21 at 07:50

2 Answers2

1

I've actually just recently built a scraper for something similar. However, the best learning is with practice and dissecting the code and implementing it yourself with amendments to it.

Here's an example scraper using all sorts of tools with scrapy to get the information on category page, the books page and the next-pages.

Some extra info is given in this link where I used the below code as part of a question. Perhaps, that will also be useful for you.

I'd suggest to use the scraper below to build your own by collecting info on the categories to the left of the page.

Quick breakdown of the scraper:

BooksItem:

  • Here we build a field to store the results from the itemloader, somewhat like a variable to store a list.

BookSpider:

  • We set up the start_urls

start_requests:

  • yied a request from the start url (I usually store things as lists to get into the habit of using multiple start urls for my projects). The callback tells you where to parse the information.

parse:

  • get the xpath to the container, create a loop and set up an item loader to make reference to the fields in BooksItem. Find the xpaths of the items, and build another request with links to the books pages. We also get information on the next page with the if-statement.

parse_book:

  • Get some info inside the book page and yield the item loader.
class BooksItem(scrapy.Item):
    items = Field(output_processor=TakeFirst())
    price = Field(output_processor=TakeFirst())
    availability = Field(output_processor=TakeFirst())


class BookSpider(scrapy.Spider):
    name = "books"
    start_urls = ['https://books.toscrape.com']

    custom_setting = {
        'DOWNLOAD_DELAY': 1,
        'ROBOTSTXT_OBEY': False,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
        'CONCURRENT_REQUESTS': 100,
        'CONCURRENT_REQUESTS_PER_IP': 100
    }

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                callback=self.parse
            )

    def parse(self, response):
        data = response.xpath('//div[@class = "col-sm-8 col-md-9"]//li')
        for books in data:
            loader = ItemLoader(BooksItem(), selector=books)
            loader.add_xpath('items', './/article[@class="product_pod"]/h3/a//text()')
            loader.add_xpath('price', './/p[@class="price_color"]//text()')

            for url in books.xpath('.//h3/a//@href').getall():
                yield scrapy.Request(
                    response.urljoin(url),
                    callback=self.parse_book,
                    cb_kwargs={'loader': loader}
                )

        # for next_page in [response.xpath('.//a[normalize-space()="next"]//@href').get()]:
        next_page = response.xpath('.//a[normalize-space()="next"]//@href').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    def parse_book(self, response, loader):
        book_quote = response.xpath('//p[@class="instock availability"]/i/following-sibling::text()').get().strip()

        loader.add_value('availability', book_quote)
        yield loader.load_item()
joe_bill.dollar
  • 374
  • 1
  • 9
0

I am also quite new to web scraping and just recently started learning it. Here is my attempt to scrape the website that you mentioned.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BooksSpider(CrawlSpider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/?']

    rules = (
        Rule(LinkExtractor(
            restrict_xpaths="//li//article[@class='product_pod']//h3/a"), callback='parse_item', follow=True),

        Rule(LinkExtractor(
            restrict_xpaths="//li[@class='next']/a"), follow=True)
    )

    def parse_item(self, response):
        yield {
            'book_title': response.xpath("//article[@class='product_page']//h1/text()").get(),
            'price': response.xpath("//article[@class='product_page']//p[@class='price_color']/text()").get(),
            'availability': response.xpath("//table[@class='table table-striped']//tr[position() = 6]/td/text()").get(),
            'no_of_reviews': response.xpath("//table[@class='table table-striped']//tr[position() = 7]/td/text()").get()
        }
Jason
  • 21
  • 7