Disable Scrapyd item storing in .jl feed

Question

Question

I want to know how to disable Item storing in scrapyd.

What I tried

I deploy a spider to the Scrapy daemon Scrapyd. The deployed spider stores the spidered data in a database. And it works fine.

However Scrapyd logs each scraped Scrapy item. You can see this when examining the scrapyd web interface. This item data is stored in ..../items/<project name>/<spider name>/<job name>.jl

I have no clue how to disable this. I run scrapyd in a Docker container and it uses way too much storage.

I have tried suppress Scrapy Item printed in logs after pipeline, but this does nothing for scrapyd logging it seems. All spider logging settings seem to be ignored by scrapyd.

Edit I found this entry in the documentation about Item storing. It seems if you omit the items_dir setting, item logging will not happen. It is said that this is disabled by default. I do not have a scrapyd.conf file, so item logging should be disabled. It is not.

neverlastn · Accepted Answer · 2016-04-25T01:16:32.973

After writing my answer I re-read your question and I see that what you want has nothing to do with logging but it's about not writing to the (default-ish) .jl feed (Maybe update the title to: "Disable scrapyd Item storing"). To override scrapyd's default, just set FEED_URI to an empty string like this:

$ curl http://localhost:6800/schedule.json -d project=tutorial -d spider=example -d setting=FEED_URI=

For other people who are looking into logging... Let's see an example. We do the usual:

$ scrapy startproject tutorial
$ cd tutorial
$ scrapy genspider example example.com

then edit tutorial/spiders/example.py to contain the following:

import scrapy

class TutorialItem(scrapy.Item):
    name = scrapy.Field()
    surname = scrapy.Field()

class ExampleSpider(scrapy.Spider):
    name = "example"

    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        for i in xrange(100):
            t = TutorialItem()
            t['name'] = "foo"
            t['surname'] = "bar %d" % i
            yield t

Notice the difference between running:

$ scrapy crawl example
# or
$ scrapy crawl example -L DEBUG
# or
$ scrapy crawl example -s LOG_LEVEL=DEBUG

and

$ scrapy crawl example -s LOG_LEVEL=INFO
# or
$ scrapy crawl example -L INFO

By trying such combinations on your spider confirm that it doesn't print Item info for log-level beyond debug.

It's now time, after you deploy to scrapyd to do exactly the same:

$ curl http://localhost:6800/schedule.json -d setting=LOG_LEVEL=INFO -d project=tutorial -d spider=example

Confirm that the logs don't contain items when you run:

Note that if your items are still printed in INFO level, it likely means that your code or some pipeline is printing it. You could rise log-level further and/or investigate and find the code that prints it and remove it.

I hoped your suggestion worked. However it doesn't. I tried LOG_LEVEL=WARN. It does not make any difference. How is this setting another setting than the log level in the settings file? — Pullie, Apr 24 '16 at 20:45
Well done for using docker by the way. I don't know exactly how you use scraped data but in general you get lower overhead if you e.g. bulk upload your `Item`s after crawling completes instead of doing it while you crawl in some item pipeline. This is the concept behind those `.jl` files. That you will hook to the `spider_closed` signal and use a bulk tool to upload this `.jl` file. But there are also rare cases where this isn't recommended/possible such as if you want minimum latency. — neverlastn, Apr 25 '16 at 01:22
In the pipeline I directly save to a MySQL database running in another Docker container. I overwrite the process_item method `def process_item(self, productitem, spider):` — Pullie, Apr 25 '16 at 05:54
Great answer! Your answer explains how to both disable item storing and logging. Your solution works. — Pullie, Apr 25 '16 at 05:55

Disable Scrapyd item storing in .jl feed

1 Answers1