0

I would like to run my scrapy sprider from python script. I can call my spider with the following code,

subprocess.check_output(['scrapy crawl mySpider'])

Untill all is well. But before that, I instantiate the class of my spider by initializing the start_urls, then the call to scrapy crawl doesn't work since it doesn't find the variable start_urls.

from flask import Flask, jsonify, request
import scrapy
import subprocess

class ClassSpider(scrapy.Spider):
    name        = 'mySpider'
    #start_urls = []
    #pages      = 0
    news        = []

    def __init__(self, url, nbrPage):
        self.pages      = nbrPage
        self.start_urls = url

    def parse(self):
        ...

    def run(self):
        subprocess.check_output(['scrapy crawl mySpider'])
        return self.news

app = Flask(__name__)
data = []

@app.route('/', methods=['POST'])
def getNews():
    mySpiderClass = ClassSpider(request.json['url'], 2)

    data.append(mySpider.run())
    return jsonify({'data': data})

if __name__ == "__main__":
    app.run(debug=True)

The error I get is: TypeError: init missing 1 required positional argument: 'start_url' and 'pages'

any help please?

Med ADDOU
  • 1
  • 6

2 Answers2

0

Another way to start your spider from a script (and provide arguments):

from scrapy.crawler import CrawlerProcess
from path.to.your.spider import ClassSpider
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
process.crawl(
    ClassSpider,
    start_urls, # you need to define it somewhere
    number_of_pages, # you need to define it somewhere
)
process.start()
gangabass
  • 10,607
  • 2
  • 23
  • 35
  • Thank you for your response. I tried it but I should know how can I define the parameters! I'll check that – Med ADDOU Jun 08 '20 at 08:53
0

The reason you are getting this error message is that you start the crawling process with the command scrapy crawl mySpider which creates a new instance of ClassSpider. It does so without passing url and nbrPage.
It could work if you replaced subprocess.check_output(['scrapy crawl mySpider']) with subprocess.check_output([f'scrapy crawl mySpider -a url={self.start_urls} nbrPage={self.pages}']). Also you should make sure that start_urls is a list.
However, then you would still create two separate instances of the same spider, so I would suggest that you should implement run as a function taking url and nbrPage as arguments.
There are also other methods of using Scrapy and Flask in the same script. For that purpose check this question.

Patrick Klein
  • 1,161
  • 3
  • 10
  • 23
  • 1
    Thank you for your response, but when I put the piece of code that you recommended, I think the start_urls list is not supported. I added a print self.start_urls just below def parse(self, response) but that display anythink. – Med ADDOU Jun 07 '20 at 20:12
  • No problem, but this is no recommendation. I would not do it this way. It's just why you get this error. Also, what? Sorry, I totally do not understand what you mean with start_urls not being supported. It should always work except you overwrite start_requests in a manner that does not support it. – Patrick Klein Jun 07 '20 at 21:31
  • Finaly, it's working. Otherwise, how could you do it? – Med ADDOU Jun 08 '20 at 08:49
  • The link I posted has an answer way down using crochet. I am currently using this approach. With that it's pretty easy for me to display the data from the spider in a page. There might be a better way, this one was just the easiest for me to implement and display the results. – Patrick Klein Jun 08 '20 at 09:22
  • Hi @Patrick K. Do you have any idea about this post please: https://stackoverflow.com/questions/62284110/how-do-i-fix-scrapy-unsupported-url-scheme-error?noredirect=1#comment110172315_62284110 – Med ADDOU Jun 10 '20 at 08:28