I've actually just recently built a scraper for something similar. However, the best learning is with practice and dissecting the code and implementing it yourself with amendments to it.
Here's an example scraper using all sorts of tools with scrapy to get the information on category page, the books page and the next-pages.
Some extra info is given in this link where I used the below code as part of a question. Perhaps, that will also be useful for you.
I'd suggest to use the scraper below to build your own by collecting info on the categories to the left of the page.
Quick breakdown of the scraper:
BooksItem
:
- Here we build a field to store the results from the itemloader, somewhat like a variable to store a list.
BookSpider
:
start_requests
:
- yied a request from the start url (I usually store things as lists to get into the habit of using multiple start urls for my projects). The callback tells you where to parse the information.
parse
:
- get the xpath to the container, create a loop and set up an item loader to make reference to the fields in
BooksItem
. Find the xpaths of the items, and build another request with links to the books pages. We also get information on the next page with the if-statement.
parse_book
:
- Get some info inside the book page and yield the item loader.
class BooksItem(scrapy.Item):
items = Field(output_processor=TakeFirst())
price = Field(output_processor=TakeFirst())
availability = Field(output_processor=TakeFirst())
class BookSpider(scrapy.Spider):
name = "books"
start_urls = ['https://books.toscrape.com']
custom_setting = {
'DOWNLOAD_DELAY': 1,
'ROBOTSTXT_OBEY': False,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
'CONCURRENT_REQUESTS': 100,
'CONCURRENT_REQUESTS_PER_IP': 100
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback=self.parse
)
def parse(self, response):
data = response.xpath('//div[@class = "col-sm-8 col-md-9"]//li')
for books in data:
loader = ItemLoader(BooksItem(), selector=books)
loader.add_xpath('items', './/article[@class="product_pod"]/h3/a//text()')
loader.add_xpath('price', './/p[@class="price_color"]//text()')
for url in books.xpath('.//h3/a//@href').getall():
yield scrapy.Request(
response.urljoin(url),
callback=self.parse_book,
cb_kwargs={'loader': loader}
)
# for next_page in [response.xpath('.//a[normalize-space()="next"]//@href').get()]:
next_page = response.xpath('.//a[normalize-space()="next"]//@href').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
def parse_book(self, response, loader):
book_quote = response.xpath('//p[@class="instock availability"]/i/following-sibling::text()').get().strip()
loader.add_value('availability', book_quote)
yield loader.load_item()