Run scrapy spider from script

Question

I would like to run my scrapy sprider from python script. I can call my spider with the following code,

subprocess.check_output(['scrapy crawl mySpider'])

Untill all is well. But before that, I instantiate the class of my spider by initializing the start_urls, then the call to scrapy crawl doesn't work since it doesn't find the variable start_urls.

from flask import Flask, jsonify, request
import scrapy
import subprocess

class ClassSpider(scrapy.Spider):
    name        = 'mySpider'
    #start_urls = []
    #pages      = 0
    news        = []

    def __init__(self, url, nbrPage):
        self.pages      = nbrPage
        self.start_urls = url

    def parse(self):
        ...

    def run(self):
        subprocess.check_output(['scrapy crawl mySpider'])
        return self.news

app = Flask(__name__)
data = []

@app.route('/', methods=['POST'])
def getNews():
    mySpiderClass = ClassSpider(request.json['url'], 2)

    data.append(mySpider.run())
    return jsonify({'data': data})

if __name__ == "__main__":
    app.run(debug=True)

The error I get is: TypeError: init missing 1 required positional argument: 'start_url' and 'pages'

any help please?

score 0 · Answer 1 · answered Jun 07 '20 at 16:40

0

Another way to start your spider from a script (and provide arguments):

from scrapy.crawler import CrawlerProcess
from path.to.your.spider import ClassSpider
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
process.crawl(
    ClassSpider,
    start_urls, # you need to define it somewhere
    number_of_pages, # you need to define it somewhere
)
process.start()

answered Jun 07 '20 at 16:40

gangabass

10,607
2
23
35

Thank you for your response. I tried it but I should know how can I define the parameters! I'll check that – Med ADDOU Jun 08 '20 at 08:53

Patrick Klein · Accepted Answer · 2020-06-07T17:23:01.357

0

The reason you are getting this error message is that you start the crawling process with the command scrapy crawl mySpider which creates a new instance of ClassSpider. It does so without passing url and nbrPage.
It could work if you replaced subprocess.check_output(['scrapy crawl mySpider']) with subprocess.check_output([f'scrapy crawl mySpider -a url={self.start_urls} nbrPage={self.pages}']). Also you should make sure that start_urls is a list.
However, then you would still create two separate instances of the same spider, so I would suggest that you should implement run as a function taking url and nbrPage as arguments.
There are also other methods of using Scrapy and Flask in the same script. For that purpose check this question.

edited Jun 07 '20 at 17:23

answered Jun 07 '20 at 16:57

Patrick Klein

1,161
3
10
23

1

Thank you for your response, but when I put the piece of code that you recommended, I think the start_urls list is not supported. I added a print self.start_urls just below def parse(self, response) but that display anythink. – Med ADDOU Jun 07 '20 at 20:12
No problem, but this is no recommendation. I would not do it this way. It's just why you get this error. Also, what? Sorry, I totally do not understand what you mean with start_urls not being supported. It should always work except you overwrite start_requests in a manner that does not support it. – Patrick Klein Jun 07 '20 at 21:31
Finaly, it's working. Otherwise, how could you do it? – Med ADDOU Jun 08 '20 at 08:49
The link I posted has an answer way down using crochet. I am currently using this approach. With that it's pretty easy for me to display the data from the spider in a page. There might be a better way, this one was just the easiest for me to implement and display the results. – Patrick Klein Jun 08 '20 at 09:22
Hi @Patrick K. Do you have any idea about this post please: https://stackoverflow.com/questions/62284110/how-do-i-fix-scrapy-unsupported-url-scheme-error?noredirect=1#comment110172315_62284110 – Med ADDOU Jun 10 '20 at 08:28

Run scrapy spider from script

2 Answers2

Linked