I am scraping BBC food for recipes. The logic is as follows:
Main page with about 20 cuisines
-> in each cuisine, there's usually ~20 recipes on 1-3 pages for each letter.
-> in each recipe, there is about 6 things I scrape (ingredients, rating etc.)
Therefore, my logic is: get to main page, create request, extract all cuisine links, then follow each, from there extract each page of recipes, follow each recipe link, and from each recipe finally get all data. Note this is not finished yet as I need to implement the spider to also go through all letters.
I would love to have a 'category' column, i.e. for each recipe in the "african cuisine" link have a column that says "african", for each recipe from the "italian cuisine" an "italian" entry in all columns etc.
Desired outcome:
cook_time prep_time name cuisine
10 30 A italian
20 10 B italian
30 20 C indian
20 10 D indian
30 20 E indian
Here is my following spider:
import scrapy
from recipes_cuisines.items import RecipeItem
class ItalianSpider(scrapy.Spider):
name = "italian_spider"
def start_requests(self):
start_urls = ['https://www.bbc.co.uk/food/cuisines']
for url in start_urls:
yield scrapy.Request(url = url, callback = self.parse_cuisines)
def parse_cuisines(self, response):
cuisine_cards = response.xpath('//a[contains(@class,"promo__cuisine")]/@href').extract()
for url in cuisine_cards:
yield response.follow(url = url, callback = self.parse_main)
def parse_main(self, response):
recipe_cards = response.xpath('//a[contains(@class,"main_course")]/@href').extract()
for url in recipe_cards:
yield response.follow(url = url, callback = self.parse_card)
next_page = response.xpath('//div[@class="pagination gel-wrap"]/ul[@class="pagination__list"]/li[@class="pagination__list-item pagination__priority--0"]/a[@class="pagination__link gel-pica-bold"]/@href').get()
if next_page is not None:
next_page_url = response.urljoin(next_page)
print(next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse_main)
def parse_card(self, response):
item = RecipeItem()
item['name'] = response.xpath('//h1[contains(@class,"title__text")]/text()').extract()
item['prep_time'] = response.xpath('//div[contains(@class,"recipe-metadata-wrap")]/p[@class="recipe-metadata__prep-time"]/text()').extract_first()
item['cook_time'] = response.xpath('//p[contains(@class,"cook-time")]/text()').extract_first()
item['servings'] = response.xpath('//p[contains(@class,"serving")]/text()').extract_first()
item['ratings_amount'] = response.xpath('//div[contains(@class="aggregate-rating")]/span[contains(@class="aggregate-rating__total")]/text()').extract()
#item['ratings_amount'] = response.xpath('//*[@id="main-content"]/div[1]/div[4]/div/div[1]/div/div[1]/div[2]/div[1]/span[2]/text()').extract()
item['ingredients'] = response.css('li.recipe-ingredients__list-item > a::text').extract()
return item
and items:
import scrapy
class RecipeItem(scrapy.Item):
name = scrapy.Field()
prep_time = scrapy.Field()
cook_time = scrapy.Field()
servings = scrapy.Field()
ratings_amount = scrapy.Field()
rating = scrapy.Field()
ingredients = scrapy.Field()
cuisine = scrapy.Field()
Note I am saving the output via
scrapy crawl italian_spider -o test.csv
I have read the documentation and tried several things, such as adding the extracted cuisine to a parse_cuisine or parse_main methods, but all yield an error.