Scrapy will start by command line but not with CrawlerProcess

Question

Here is my spider :

import scrapy


class PhonesCDSpider(scrapy.Spider):
    name = "phones_CD"

    custom_settings = {
        "FEEDS": {
            "Spiders/spiders/cd.json": {"format": "json"},
        },
    }

    start_urls = [
        'https://www.cdiscount.com/telephonie/telephone-mobile/smartphones/tous-nos-smartphones/l-144040211.html'
    ]

    def parse(self, response):
        for phone in response.css('div.prdtBlocInline.jsPrdtBlocInline'):
        phone_url = phone.css('div.prdtBlocInline.jsPrdtBlocInline a::attr(href)').get()

            # go to the phone page

            yield response.follow(phone_url, callback=self.parse_phone


    def parse_phone(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('span.fpPrice.price.jsMainPrice.jsProductPrice.hideFromPro::attr(content)').get(),
            'EAN' : response.css('script').getall(),
            'image_url' : response.css('div.fpMainImg a::attr(href)').get(),
            'url': response.url

        }

If I start it in the terminal with: scrapy crawl phones_CD -O test.json, it works fine. But if I run it in my python script (where the other crawlers work and are configured the same way):

    def all_crawlers():
        process = CrawlerProcess()
        process.crawl(PhonesCBSpider)
        process.crawl(PhonesKFSpider)
        process.crawl(PhonesMMSpider)
        process.crawl(PhonesCDSpider)
        process.start()
    all_crawlers()

I get an error, here is the traceback :

2021-01-05 18:16:06 [scrapy.core.engine] INFO: Spider opened
2021-01-05 18:16:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-01-05 18:16:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6026
2021-01-05 18:16:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/telephonie/telephone-mobile/smartphones/tous-nos-smartphones/l-144040211.html> (referer: None)
2021-01-05 18:16:07 [scrapy.core.engine] INFO: Closing spider (finished)

Thanks in advance for your time!

It may help you https://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script — Samsul Islam, Jan 05 '21 at 17:25
I forgot to say that I have other crawlers that are working fine in the same python file though — lundsonn, Jan 05 '21 at 17:28
@lundsonn Are You sure that it is an error? If you launch 4 spiders in single crawler process - it is expected that You will get 4 `Spider opened` messages and 4 `Closing spider` messages as well as other info messages will be duplicated for each spider. — Georgiy, Jan 06 '21 at 08:47
A [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) could get you more feedback, or even help you figure out the issue for yourself. — Gallaecio, Feb 22 '21 at 03:49

score 0 · Answer 1 · answered Jan 05 '21 at 18:45

0

According to Scrapy docs feed-exports
Scrapy FEEDS setting does not support relative path like your "Spiders/spiders/cd.json".

answered Jan 05 '21 at 18:45

Georgiy

3,158
1
6
18

My other spiders have a relative path for their feeds as well and work perfectly... So I don't think that's the issue... Moreover the file "cd.json" is created but is just empty as ùy scraper stops. – lundsonn Jan 05 '21 at 19:21

Scrapy will start by command line but not with CrawlerProcess

1 Answers1