1

Question

I want to know how to disable Item storing in scrapyd.

What I tried

I deploy a spider to the Scrapy daemon Scrapyd. The deployed spider stores the spidered data in a database. And it works fine.

However Scrapyd logs each scraped Scrapy item. You can see this when examining the scrapyd web interface. This item data is stored in ..../items/<project name>/<spider name>/<job name>.jl

I have no clue how to disable this. I run scrapyd in a Docker container and it uses way too much storage.

I have tried suppress Scrapy Item printed in logs after pipeline, but this does nothing for scrapyd logging it seems. All spider logging settings seem to be ignored by scrapyd.

Edit I found this entry in the documentation about Item storing. It seems if you omit the items_dir setting, item logging will not happen. It is said that this is disabled by default. I do not have a scrapyd.conf file, so item logging should be disabled. It is not.

Community
  • 1
  • 1
Pullie
  • 2,685
  • 3
  • 25
  • 31

1 Answers1

1

After writing my answer I re-read your question and I see that what you want has nothing to do with logging but it's about not writing to the (default-ish) .jl feed (Maybe update the title to: "Disable scrapyd Item storing"). To override scrapyd's default, just set FEED_URI to an empty string like this:

$ curl http://localhost:6800/schedule.json -d project=tutorial -d spider=example -d setting=FEED_URI=

For other people who are looking into logging... Let's see an example. We do the usual:

$ scrapy startproject tutorial
$ cd tutorial
$ scrapy genspider example example.com

then edit tutorial/spiders/example.py to contain the following:

import scrapy

class TutorialItem(scrapy.Item):
    name = scrapy.Field()
    surname = scrapy.Field()

class ExampleSpider(scrapy.Spider):
    name = "example"

    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        for i in xrange(100):
            t = TutorialItem()
            t['name'] = "foo"
            t['surname'] = "bar %d" % i
            yield t

Notice the difference between running:

$ scrapy crawl example
# or
$ scrapy crawl example -L DEBUG
# or
$ scrapy crawl example -s LOG_LEVEL=DEBUG

and

$ scrapy crawl example -s LOG_LEVEL=INFO
# or
$ scrapy crawl example -L INFO

By trying such combinations on your spider confirm that it doesn't print Item info for log-level beyond debug.

It's now time, after you deploy to scrapyd to do exactly the same:

$ curl http://localhost:6800/schedule.json -d setting=LOG_LEVEL=INFO -d project=tutorial -d spider=example

Confirm that the logs don't contain items when you run:

enter image description here

Note that if your items are still printed in INFO level, it likely means that your code or some pipeline is printing it. You could rise log-level further and/or investigate and find the code that prints it and remove it.

neverlastn
  • 2,164
  • 16
  • 23
  • I hoped your suggestion worked. However it doesn't. I tried LOG_LEVEL=WARN. It does not make any difference. How is this setting another setting than the log level in the settings file? – Pullie Apr 24 '16 at 20:45
  • 1
    Well done for using docker by the way. I don't know exactly how you use scraped data but in general you get lower overhead if you e.g. bulk upload your `Item`s after crawling completes instead of doing it while you crawl in some item pipeline. This is the concept behind those `.jl` files. That you will hook to the `spider_closed` signal and use a bulk tool to upload this `.jl` file. But there are also rare cases where this isn't recommended/possible such as if you want minimum latency. – neverlastn Apr 25 '16 at 01:22
  • In the pipeline I directly save to a MySQL database running in another Docker container. I overwrite the process_item method `def process_item(self, productitem, spider):` – Pullie Apr 25 '16 at 05:54
  • Great answer! Your answer explains how to both disable item storing and logging. Your solution works. – Pullie Apr 25 '16 at 05:55