6

Following document, I can run scrapy from a Python script, but I can't get the scrapy result.

This is my spider:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from items import DmozItem

class DmozSpider(BaseSpider):
    name = "douban" 
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/group/xxx/discussion"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select("//table[@class='olt']/tr/td[@class='title']/a")
        items = []
        # print sites
        for row in rows:
            item = DmozItem()
            item["title"] = row.select('text()').extract()[0]
            item["link"] = row.select('@href').extract()[0]
            items.append(item)

        return items

Notice the last line, I try to use the returned parse result, if I run:

 scrapy crawl douban

the terminal could print the return result

But I can't get the return result from the Python script. Here is my Python script:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from spiders.dmoz_spider import DmozSpider
from scrapy.xlib.pydispatch import dispatcher

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = DmozSpider(domain='www.douban.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg("------------>Running reactor")
result = reactor.run()
print result
log.msg("------------>Running stoped")

I try to get the result at the reactor.run(), but it return nothing,

How can I get the result?

Kenster
  • 23,465
  • 21
  • 80
  • 106
hh54188
  • 14,887
  • 32
  • 113
  • 184
  • where do you put the script please? in the scrapy project, or in the spider folder, or what? – William Kinaan Feb 09 '14 at 18:19
  • Cross-referencing [this answer](http://stackoverflow.com/a/27744766/771848) - should give you a detailed overview on how to run Scrapy from a script. – alecxe Jan 03 '15 at 01:40

3 Answers3

8

Terminal prints the result because the default log level is set to DEBUG.

When you are running your spider from the script and call log.start(), the default log level is set to INFO.

Just replace:

log.start()

with

log.start(loglevel=log.DEBUG)

UPD:

To get the result as string, you can log everything to a file and then read from it, e.g.:

log.start(logfile="results.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)

reactor.run()

with open("results.log", "r") as f:
    result = f.read()
print result

Hope that helps.

Igor Medeiros
  • 4,026
  • 2
  • 26
  • 32
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you,this solve one of the problems, but how can I in the script the get the return result in spider? what's wrong with `result = reactor.run()` – hh54188 Jul 10 '13 at 07:18
  • You are welcome. I've updated the answer - added an option how to get the results as string. – alecxe Jul 10 '13 at 07:36
  • 2
    That permit show the output but instead of collect the data from there I think the proper way to do this is write or use pipelines. – Igor Medeiros Sep 25 '13 at 22:15
  • please where should I put the script? i want to run the spider from the script, i can do that as the question says. but i don't know where to put that script. should I put it directly in the project folder? or in the spider folder? or what. thanks in advance – William Kinaan Feb 09 '14 at 18:22
5

I found your question while asking myself the same thing, namely: "How can I get the result?". Since this wasn't answered here I endeavoured to find the answer myself and now that I have I can share it:

items = []
def add_item(item):
    items.append(item)
dispatcher.connect(add_item, signal=signals.item_passed)

Or for scrapy 0.22 (http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script) replace the last line of my solution by:

crawler.signals.connect(add_item, signals.item_passed)

My solution is freely adapted from http://www.tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/.

Ixio
  • 517
  • 6
  • 21
  • This works perfectly for me however, since version 1.1 of Scrapy, it gives me a deprecation warning on the line "from scrapy.xlib.pydispatch import dispatcher" which is ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See: https://github.com/scrapy/scrapy/issues/1762 from scrapy.xlib.pydispatch import dispatcher – Derwent Oct 09 '16 at 05:31
0

in my case, i placed the script file at scrapy project level e.g. if scrapyproject/scrapyproject/spiders then i placed it at scrapyproject/myscript.py

Zubair Alam
  • 8,739
  • 3
  • 26
  • 25