Python Scrapy - saving a 'category' for each entry based on first webpage

Question

I am scraping BBC food for recipes. The logic is as follows:

Main page with about 20 cuisines
-> in each cuisine, there's usually ~20 recipes on 1-3 pages for each letter.
-> in each recipe, there is about 6 things I scrape (ingredients, rating etc.)

Therefore, my logic is: get to main page, create request, extract all cuisine links, then follow each, from there extract each page of recipes, follow each recipe link, and from each recipe finally get all data. Note this is not finished yet as I need to implement the spider to also go through all letters.

I would love to have a 'category' column, i.e. for each recipe in the "african cuisine" link have a column that says "african", for each recipe from the "italian cuisine" an "italian" entry in all columns etc.

Desired outcome:

cook_time  prep_time  name  cuisine
  10         30         A      italian
  20         10         B      italian
  30         20         C      indian
  20         10         D      indian
  30         20         E      indian

Here is my following spider:

import scrapy
from recipes_cuisines.items import RecipeItem

class ItalianSpider(scrapy.Spider):
    
    name = "italian_spider"
    
    def start_requests(self):
        start_urls =  ['https://www.bbc.co.uk/food/cuisines']
        for url in start_urls:
            yield scrapy.Request(url = url, callback = self.parse_cuisines)
    
    def parse_cuisines(self, response):
        cuisine_cards = response.xpath('//a[contains(@class,"promo__cuisine")]/@href').extract()
        for url in cuisine_cards:
            yield response.follow(url = url, callback = self.parse_main)
    
    def parse_main(self, response):
        recipe_cards = response.xpath('//a[contains(@class,"main_course")]/@href').extract()
        for url in recipe_cards:
            yield response.follow(url = url, callback = self.parse_card)
        next_page = response.xpath('//div[@class="pagination gel-wrap"]/ul[@class="pagination__list"]/li[@class="pagination__list-item pagination__priority--0"]/a[@class="pagination__link gel-pica-bold"]/@href').get()
        if next_page is not None:
            next_page_url = response.urljoin(next_page)
            print(next_page_url)
            yield scrapy.Request(url = next_page_url, callback = self.parse_main)

    def parse_card(self, response):
        item = RecipeItem()
        item['name'] = response.xpath('//h1[contains(@class,"title__text")]/text()').extract()
        item['prep_time'] = response.xpath('//div[contains(@class,"recipe-metadata-wrap")]/p[@class="recipe-metadata__prep-time"]/text()').extract_first()
        item['cook_time'] = response.xpath('//p[contains(@class,"cook-time")]/text()').extract_first()
        item['servings'] = response.xpath('//p[contains(@class,"serving")]/text()').extract_first()
        item['ratings_amount'] = response.xpath('//div[contains(@class="aggregate-rating")]/span[contains(@class="aggregate-rating__total")]/text()').extract()
        #item['ratings_amount'] = response.xpath('//*[@id="main-content"]/div[1]/div[4]/div/div[1]/div/div[1]/div[2]/div[1]/span[2]/text()').extract()
        item['ingredients'] = response.css('li.recipe-ingredients__list-item > a::text').extract()
        return item

and items:

import scrapy


class RecipeItem(scrapy.Item):
    name = scrapy.Field()
    prep_time = scrapy.Field()
    cook_time = scrapy.Field()
    servings = scrapy.Field()
    ratings_amount = scrapy.Field()
    rating = scrapy.Field()
    ingredients = scrapy.Field()
    cuisine = scrapy.Field()

Note I am saving the output via

scrapy crawl italian_spider -o test.csv

I have read the documentation and tried several things, such as adding the extracted cuisine to a parse_cuisine or parse_main methods, but all yield an error.

Not sure where the category is encoded. If in the URL, you can get it somehow with `response.url`, else via some additional scraping I would assume. You can then pass the category string as optional argument to `parse_card`, like here: https://stackoverflow.com/a/60035564/9360161 (please refer to the documentation for your current Scrapy version, as interfaces change over time.) — E. Körner, Nov 30 '20 at 21:34
Unfortunately no, the url does not contain the cuisine and neither is it anywhere in the recipe page. I will check out the thread you linked. — JachymDvorak, Nov 30 '20 at 21:40
No the idea was, the cuisine category is probably on the main page (I assume?, if not in the URL), you extract it there, i.e. `parse_main`, and then pass it down to each recipe page `parse_card` for storage. The link contains an example of how to pass a value down. — E. Körner, Nov 30 '20 at 21:44

gangabass · Accepted Answer · 2020-12-01T15:09:55.897

There are two ways here. Most common way is to pass some information from one page to another is to use cb_kwargs in your scrapy.Request:

def parse_cousine(self, response):
    cousine = response.xpath('//h1/text()').get()
    for recipe_url in response.xpath('//div[@id="az-recipes--recipes"]//a[.//h3]').getall():
        yield scrapy.Request(
            url=response.urljoin(recipe_url),
            callback=self.parse_recipe,
            cb_kwargs={'cousine': cousine},
        )
def parse_recipe(self, response, cousine):
    print(cousine)

But one this website you can find it on the recipe page (inside ingredients section after parsing JSON):

def parse_recipe(self, response):
    recipe_raw = response.xpath('//script[@type="application/ld+json"][contains(., \'"@type":"Recipe"\')]/text()').get()
    recipe = json.loads(recipe_raw)
    cousine = recipe['recipeCuisine']

Update This XPath '//script[@type="application/ld+json"][contains(., \'"@type":"Recipe"\')]/text()' finds script node that have type attribute with a value application/ld+json and also contains string "@type":"Recipe" in a text of that node.

Thank you, the second works, I will try the first one as well. However, help me please understand your code, so I fully grasp what you did: what does this part do: `[contains(., \'"@type":"Recipe"\')]`? I understand that you're looking for where the json file contains "@type":"Recipe", and I think the back slashes "\" are to literally match, but what about the dot? And what exactly does this line of code do? — JachymDvorak, Dec 01 '20 at 14:54

Python Scrapy - saving a 'category' for each entry based on first webpage

1 Answers1