How to save Scrapy Broad Crawl Results?

Question

Scrapy has a built-in way of persisting results in AWS S3 using the FEEDS setting.

but for a broad crawl over different domains this would create a single file, where the results from all domains are saved.

how could I save the results of each domain in its own separate file?

I wasn't able to find any reference to this in the documentation.

score 0 · Answer 1 · answered Feb 14 '23 at 07:26

In the FEED_URI setting, you can add placeholder that will be replaced with scraped data.

For i.e. domain name can be incluede in the file name by using the domain attribute like this

FEED_URI = 's3://my-bucket/{domain}/%(time)s.json'

This solution would only work if the spider were run one time per domain, but since you haven't explicitly said so, I would assume a single run crawls multiple domains. If you know all domains beforehand, you can generate the value of the FEEDS setting programmatically and use item filtering.

# Assumes that items have a domain field and that all target domains are 
# defined in an ALL_DOMAINS variable.

class DomainFilter:

    def __init__(self, feed_options):
        self.domain = feed_options["domain"]

    def accepts(self, item):
        return item["domain"] == self.domain


ALL_DOMAINS = ["toscrape.com", ...]
FEEDS = {
    f"s3://mybucket/{domain}.jsonl": {
        "format": "jsonlines",
        "item_filter": DomainFilter,
    }
    for domain in ALL_DOMAINS
}

How to save Scrapy Broad Crawl Results?

1 Answers1