scrapy run spider from script

Question

I want to run my spider from a script rather than a scrap crawl

I found this page

http://doc.scrapy.org/en/latest/topics/practices.html

but actually it doesn't say where to put that script.

any help please?

Presumably, you put that code in the script you want to use to run the spider. — Talvalin, Feb 09 '14 at 18:05
Providing your system path and PYTHONPATH are set up correctly, you should be able to put the script anywhere you like — Talvalin, Feb 09 '14 at 18:25
@Talvalin so even my scrapy project should be in pythonpath? if yes. lets say that i have 5 projects to scrap this domain `xxx.com` which one will be fire if all the spiders have the same name but in different projects? i actually have this case — Marco Dinatsoli, Feb 09 '14 at 18:32
You don't need to create a project for a simple spider, you can use the command `scrapy runspider`. My answer contains a complete example. — Elias Dorneles, Jun 23 '14 at 00:11
Cross-referencing [this answer](http://stackoverflow.com/a/27744766/771848) - should give you a detailed overview on how to run Scrapy from a script. — alecxe, Jan 03 '15 at 01:46
Possible duplicate of [How to run Scrapy from within a Python script](http://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script) — NoDataDumpNoContribution, Aug 07 '16 at 20:21

Almog Cohen · Answer 1 · 2018-08-02T17:45:09.513

41

It is simple and straightforward :)

Just check the official documentation. I would make there a little change so you could control the spider to run only when you do python myscript.py and not every time you just import from it. Just add an if __name__ == "__main__":

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    pass

if __name__ == "__main__":
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(MySpider)
    process.start() # the script will block here until the crawling is finished

Now save the file as myscript.py and run 'python myscript.py`.

Enjoy!

edited Aug 02 '18 at 17:45

answered Jan 25 '16 at 07:56

Almog Cohen

1,283
1
13
12

1

I'm done with your code. So, how to save the output to JSON file? If we run scrapy from command prompt or terminal, we can set **scrapy crawl MySpider -o output.json** and we get the output JSON file. If we use your code, where can I put the code to save the output to JSON file? @AlmogCohen – syaifulhusein Aug 01 '18 at 16:14
@syaifulhusein Yup, it is more complicated. I can see that in one of my projects I did the following thing: I've created a Scrapy Pipleline that store processed items in memory. Then at the end of the spider run I took them out of the memory store and manipulate the items as I wish. If willing to open a new question I can answer there with a code example. – Almog Cohen Aug 07 '18 at 15:12
+Almog Cohen I think your code example is the solution to my question at https://stackoverflow.com/questions/52141729/get-scrapy-result-inside-a-django-view Could you answer there, please? – Paulo Fabrício Sep 03 '18 at 02:09
@PaulozOiOzuLLuFabrício you got it man. See my answer there. syaifulhusein you can go there as well if you still need an answer to your question. https://stackoverflow.com/questions/52141729/get-scrapy-result-inside-a-django-view – Almog Cohen Sep 04 '18 at 08:20
Very elegant solution – WJA Apr 04 '20 at 14:05
@syaifulhusein You can add the location of output by specifying custom settings for a spider, `settings = {"FEED_FORMAT": "json", "FEED_URI": "/tmp/verizon.json"} crawler_process = CrawlerProcess(settings=settings)` – learnToCode Jan 11 '21 at 05:41

score 6 · Answer 2 · answered Feb 10 '14 at 16:26

6

luckily scrapy source is open, so you can follow the way crawl command works and do the same in your code:

...
crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()

answered Feb 10 '14 at 16:26

Guy Gavriely

11,228
6
27
42

2

what is `**opts.spargs` ? – Yuda Prawira May 12 '17 at 02:29

score 6 · Answer 3 · answered Feb 10 '16 at 20:44

6

Why don't you just do this?

from scrapy import cmdline

cmdline.execute("scrapy crawl myspider".split())

Put that script in the same path where you put scrapy.cfg

answered Feb 10 '16 at 20:44

Aminah Nuraini

18,120
8
90
108

Elias Dorneles · Answer 4 · 2014-07-19T16:53:37.370

3

You can just create a normal Python script, and then use Scrapy's command line option runspider, that allows you to run a spider without having to create a project.

For example, you can create a single file stackoverflow_spider.py with something like this:

import scrapy

class QuestionItem(scrapy.item.Item):
    idx = scrapy.item.Field()
    title = scrapy.item.Field()

class StackoverflowSpider(scrapy.spider.Spider):
    name = 'SO'
    start_urls = ['http://stackoverflow.com']
    def parse(self, response):
        sel = scrapy.selector.Selector(response)
        questions = sel.css('#question-mini-list .question-summary')
        for i, elem in enumerate(questions):
            l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)
            l.add_value('idx', i)
            l.add_xpath('title', ".//h3/a/text()")
            yield l.load_item()

Then, provided you have scrapy properly installed, you can run it using:

scrapy runspider stackoverflow_spider.py -t json -o questions-items.json

edited Jul 19 '14 at 16:53

answered Jun 23 '14 at 00:09

Elias Dorneles

22,556
11
85
107

I get this error on the line `l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)` : `exceptions.AttributeError: 'module' object has no attribute 'loader'` . Any idea on how to solve that ? – Basj Jan 02 '15 at 14:29
@Basj you need to ``import scrapy.contrib.loader`` for that to work. – Elias Dorneles Jan 02 '15 at 15:04
Thanks @elias. What should I add to your code here to be able to run it from `python stackoverflow_spider.py` instead of `scrapy runspider ...` ? I'm really stuck with this little question. – Basj Jan 02 '15 at 15:09
@Basj Due to the way Scrapy works on top of Twisted, that is not the recommended way to use it: usually you run spiders using scrapy command or deploy the project on an instance of [scrapyd](https://scrapyd.readthedocs.org). If you really want to run in a script, I can't add much you'd get from those docs -- I've never done it personally. – Elias Dorneles Jan 02 '15 at 16:02

scrapy run spider from script

4 Answers4

Linked