I've managed to write a Scrapy project that scrapes data from a web page, and when I call it with the scrapy crawl dmoz -o items.json -t json
at the command line, it successfully outputs the scraped data to a JSON file.
I then wrote another script that takes that JSON file, loads it, changes the way the data is organized (I didn't like the default way it was being organized), and spits it out as a second JSON file. I then use Django's manage.py loaddata fixture.json
command to load the contents of that second file into a Django database.
Now, I'm sensing that I'm going to get laughed out of the building for doing this in three separate steps, but I'm not quite sure how to put it all together into one script. For starters, it does seem really stupid that I can't just have my Scrapy project output my data in the exact way that I want. But where do I put the code to modify the 'default' way that Feed exports
is outputting my data? Would it just go in my pipelines.py
file?
And secondly, I want to call the scraper from inside a python script that will then also load the resulting JSON fixture into my database. Is that as simple as putting something like:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
at the top of my script, and then following it with something like:
from django.something.manage import loaddata
loaddata('/path/to/fixture.json')
? And finally, is there any specific place this script would have to live relative to both my Django project and the Scrapy project for it to work properly?