0

I am using Scrapy to crawl several websites, and need the output to be in JSON. I have set the command:

scrapy crawl MySpider -o "path/to/output.json" -t json

That works, however, now I need to add stats to output - the list of requests, errors, types of errors (404s, etc.). Also, I need the output file to be rewritten, not appended to. I can't find any instructions how to do this.

Ognjen
  • 2,508
  • 2
  • 31
  • 47

2 Answers2

0

AFAIK Item Exporters deal only with items, so it's not logical to make JsonItemExporter exporting stats to the same file -- the structure of the data is different.

If you want the data to be overwritten -- delete the file before doing the export.

warvariuc
  • 57,116
  • 41
  • 173
  • 227
0

The item output and the spider's stdout/stderr are two separate concepts, and you are better of not mixing those.
Leave the items part as is, to get the items in a separate file, and collect other useful spider output by redirecting it to a log file, like this:

scrapy crawl MySpider -o "path/to/output.json" > out.log 2>&1

Now you will have all the log in the out.log file, and you can find the final stats there. Notice that you don't need to specify the format with -t explicitly if you use the proper file extension. Also, there is currently no way to change the behavior of appending to the output rather than overwriting, so you can just remove the file before, like:

rm output.json ; scrapy crawl MySpider -o "path/to/output.json" > out.log 2>&1
bosnjak
  • 8,424
  • 2
  • 21
  • 47
  • Thanks. Is there a way to do it from Python? I need to have all in one script, and not to generate temporary files, if possible, so the output should be somehow redirected to Python then, and I would like to generate output json manually. – Ognjen Apr 28 '15 at 11:16
  • I'm confused if I should use this http://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script or just make custom item pipeline? – Ognjen Apr 28 '15 at 11:30
  • Depends on what you want to do. If you want to run the crawl completely from a Python script, you can find the answers in that link. If you want to modify the output of items, look into [feed exporters](http://doc.scrapy.org/en/latest/topics/feed-exports.html), you can find some examples on SO too. – bosnjak Apr 28 '15 at 13:32
  • I would rather use item pipeline, that would allow me to completely control the output, but I'm not sure how to access spider's stats from the pipeline. – Ognjen Apr 28 '15 at 14:00