6

My scraper works fine when I run it from the command line, but when I try to run it from within a python script (with the method outlined here using Twisted) it does not output the two CSV files that it normally does. I have a pipeline that creates and populates these files, one of them using CsvItemExporter() and the other using writeCsvFile(). Here is the code:

class CsvExportPipeline(object):

    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        nodes = open('%s_nodes.csv' % spider.name, 'w+b')
        self.files[spider] = nodes
        self.exporter1 = CsvItemExporter(nodes, fields_to_export=['url','name','screenshot'])
        self.exporter1.start_exporting()

        self.edges = []
        self.edges.append(['Source','Target','Type','ID','Label','Weight'])
        self.num = 1

    def spider_closed(self, spider):
        self.exporter1.finish_exporting()
        file = self.files.pop(spider)
        file.close()

        writeCsvFile(getcwd()+r'\edges.csv', self.edges)

    def process_item(self, item, spider):
        self.exporter1.export_item(item)

        for url in item['links']:
            self.edges.append([item['url'],url,'Directed',self.num,'',1])
            self.num += 1
        return item

Here is my file structure:

SiteCrawler/      # the CSVs are normally created in this folder
    runspider.py  # this is the script that runs the scraper
    scrapy.cfg
    SiteCrawler/
        __init__.py
        items.py
        pipelines.py
        screenshooter.py
        settings.py
        spiders/
            __init__.py
            myfuncs.py
            sitecrawler_spider.py

The scraper appears to function normally in all other ways. The output at the end in the command line suggests that the expected number of pages were crawled and the spider appears to have finished normally. I am not getting any error messages.

---- EDIT : ----

Inserting print statements and syntax errors into the pipeline has no effect, so it appears that the pipeline is being ignored. Why might this be?

Here is the code for the script that runs the scraper (runspider.py):

from twisted.internet import reactor

from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.xlib.pydispatch import dispatcher
import logging

from SiteCrawler.spiders.sitecrawler_spider import MySpider

def stop_reactor():
    reactor.stop()

dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')   
ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
Joe_AK
  • 421
  • 2
  • 4
  • 16
  • 1
    Could the files be written somewhere else? can you check your output file paths or use absolute file paths? – paul trmbrth Jul 20 '13 at 10:42
  • @pault. Good point. I have now tried it using os.path.dirname(__file__), getcwd() and the exact file path typed in. Unfortunately, these haven't made any difference. – Joe_AK Jul 20 '13 at 10:59
  • I've tried adding print statements to show what getcwd() and os.path.dirname(file) output, but they don't seem to execute. Does that mean the pipeline is being ignored? Or is running this inside the reactor interfering with me printing? – Joe_AK Jul 20 '13 at 11:02
  • OK - I just added some horrific syntax error into my pipeline code and this had no effect, so it seems the pipeline is being ignored. Any idea why this might be? – Joe_AK Jul 20 '13 at 11:07
  • 1
    I guess it has to do with which settings are actually used. what does the log say at the beginning? You should have all enabled middleware and pipelines listed – paul trmbrth Jul 20 '13 at 11:09
  • @pault. Yes! You're right, it has none of those things: no extensions enabled, no middleware enabled, no pipelines enabled. Do you know what I would have to do to correct this? – Joe_AK Jul 20 '13 at 11:15
  • 1
    I'm looking at http://doc.scrapy.org/en/latest/topics/api.html#scrapy.settings.Settings and https://github.com/scrapy/scrapy/blob/master/scrapy/settings/__init__.py. Maybe you have to use `CrawlerSetting(settings.module.to.use)`. At least you should be able to check in your runspider.py by separating `mysettings = CrawlerSettings(settings.modules.to.use)`, maybe printing out some values from these settings with `mysettings.get(setting_name)`, and then `crawler = Crawler(mysettings)...` – paul trmbrth Jul 20 '13 at 11:21
  • 1
    Might be exactly that: https://groups.google.com/forum/#!topic/scrapy-users/LrXYwy-0qRE, see https://groups.google.com/d/msg/scrapy-users/LrXYwy-0qRE/V3emUREJplQJ – paul trmbrth Jul 20 '13 at 11:25
  • Fantastic! It works! I replaced "from scrapy.settings import Settings" with "from scrapy.utils.project import get_project_settings as Settings" and it works perfectly now. Thank you! – Joe_AK Jul 20 '13 at 11:39
  • 2
    Great! I may need that in the future too. You can post your own answer with how you resolved it. – paul trmbrth Jul 20 '13 at 11:43
  • Cross-referencing [this answer](http://stackoverflow.com/a/27744766/771848) - should give you a detailed overview on how to run Scrapy from a script. – alecxe Jan 03 '15 at 01:40
  • @jkdune: since you found your own answer, please post it and accept it so this question can be closed, it's been 18 months. – smci Jan 19 '15 at 02:22

2 Answers2

1

Replacing "from scrapy.settings import Settings" with "from scrapy.utils.project import get_project_settings as Settings" fixed the problem.

The solution was found here. No explanation of the solution was provided.

alecxe has provided an example of how to run Scrapy from inside a Python script.

EDIT:

Having read through alecxe's post in more detail, I can now see the difference between "from scrapy.settings import Settings" and "from scrapy.utils.project import get_project_settings as Settings". The latter allows you to use your project's settings file, as opposed to a defualt settings file. Read alecxe's post (linked to above) for more detail.

Community
  • 1
  • 1
Joe_AK
  • 421
  • 2
  • 4
  • 16
0

In my project i call the scrapy code inside another python script using os.system

import os
os.chdir('/home/admin/source/scrapy_test')
command = "scrapy crawl test_spider -s FEED_URI='file:///home/admin/scrapy/data.csv' -s LOG_FILE='/home/admin/scrapy/scrapy_test.log'"
return_code = os.system(command)
print 'done'
Sriram
  • 8,574
  • 4
  • 21
  • 30