How to pass start_urls to scrapy

Question

Based on suggestions here, I'm trying:

scrapy crawl spider-name -a start_urls="https://start-url.com/"

I get:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/scrapy/core/engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python3.9/site-packages/scrapy/spiders/__init__.py", line 77, in start_requests
    yield Request(url, dont_filter=True)
  File "/usr/local/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/usr/local/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 73, in _set_url
    raise ValueError(f'Missing scheme in request url: {self._url}')

To reproduce, run the following:

scrapy startproject example_project
cd example_project
scrapy genspider spider1 https://stackoverflow.com
scrapy crawl spider1 -a start_urls="https://stackoverflow.com"

I edited the question, and included a similar example – watch-this Sep 08 '21 at 10:55 — watch-this, Sep 08 '21 at 10:55

Ivs · Answer 1 · 2021-09-08T11:08:46.817

0

The command scrapy genspider generates this code:

import scrapy


class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['https://stackoverflow.com']
    start_urls = ['https://stackoverflow.com/']

    def parse(self, response):
        pass

This does not handle start_urls as a command line parameter. To make it do so, follow the guide you linked. Something like

import scrapy


class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['https://stackoverflow.com']

    def __init__(self, *args, **kwargs):
        super(Spider1Spider, self).__init__(*args, **kwargs)
        self.start_urls = kwargs.get('start_urls').split(',')

    def parse(self, response):
        pass

will work.

Note: self.start_urls expects a list, so if receives a string it will complain.

edited Sep 08 '21 at 11:08

answered Sep 08 '21 at 11:03

Ivs

45
1
9

I assumed this can be implied without explicitly defining `self.start_urls = ...` or maybe there is some way that already exists for achieving the same result. – watch-this Sep 08 '21 at 11:21
It isn't handled implicitly, no. See the docs for `scrapy.Spider` and `start_urls` here: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy-spider . Specifying the urls explicitly is the only way. – Ivs Sep 08 '21 at 11:24
Since there's no difference between this answer and the ones [here](https://stackoverflow.com/questions/9681114/how-to-give-url-to-scrapy-for-crawling) this can possibly be considered a duplicate, so I may close the question. – watch-this Sep 08 '21 at 11:30

How to pass start_urls to scrapy

1 Answers1