Run Scrapy script, process output, and load into database all at once?

Question

I've managed to write a Scrapy project that scrapes data from a web page, and when I call it with the scrapy crawl dmoz -o items.json -t json at the command line, it successfully outputs the scraped data to a JSON file.

I then wrote another script that takes that JSON file, loads it, changes the way the data is organized (I didn't like the default way it was being organized), and spits it out as a second JSON file. I then use Django's manage.py loaddata fixture.json command to load the contents of that second file into a Django database.

Now, I'm sensing that I'm going to get laughed out of the building for doing this in three separate steps, but I'm not quite sure how to put it all together into one script. For starters, it does seem really stupid that I can't just have my Scrapy project output my data in the exact way that I want. But where do I put the code to modify the 'default' way that Feed exports is outputting my data? Would it just go in my pipelines.py file?

And secondly, I want to call the scraper from inside a python script that will then also load the resulting JSON fixture into my database. Is that as simple as putting something like:

from twisted.internet import reactor  
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

at the top of my script, and then following it with something like:

from django.something.manage import loaddata

loaddata('/path/to/fixture.json')

? And finally, is there any specific place this script would have to live relative to both my Django project and the Scrapy project for it to work properly?

score 1 · Accepted Answer · answered Feb 02 '13 at 19:52

Exactly that. Define a custom item pipeline in pipelines.py to output the item data as desired, then add the pipeline class to settings.py. The scrapy documentation has a JSONWriterPipeline example that may be of use.
Well, on the basis that the script in your example was taken from the scrapy documentation, it should work. Have you tried it?
The location shouldn't matter, so long as all of the necessary imports work. You could test this by firing a Python interpreter in the desired location and then checking all of the imports one by one. If they all run correctly, then the script should be fine.

If anything goes wrong, then post another question in here with the relevant code and what was tried and I'm sure someone will be happy to help out. :)

Thanks for your answer, Talvalin. I decided to try something else (read: simpler) but now I'm having some trouble with that, and it may be related to part 3 of your answer. Any chance you'd be willing to take a look? I've posted [here](http://stackoverflow.com/questions/14686223/scrapy-project-cant-find-django-core-management). EDIT: Whoops, just noticed you're the sole commenter on that new question. — GChorn, Feb 05 '13 at 04:03

Run Scrapy script, process output, and load into database all at once?

1 Answers1