I have been searching how to automate and write files to Excel in Scrapy (CSV). And so far, the only doable command is the tedious, manual method of:
scrapy crawl myscript -o myscript.csv -t csv
I want to be able to format each of these into a more collected "row" format. Further more, is there any way I can make the scraper automated? Ideally, I want the code to run once per day, and I want to be able to notify myself when there has been an update regarding my scrape. With update being a relevant post.
My spider is working, and here is the code:
import scrapy
from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem
class Spider(XMLFeedSpider):
name = "Test"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=GOOGL',)
itertag = 'item'
def parse_node(self, response, node):
item = {}
item['title'] = node.xpath('title/text()',).extract_first()
item['pubDate'] = node.xpath('link/pubDate/text()').extract_first()
item['link'] = node.xpath('link/text()').extract_first()
item['description'] = node.xpath('description/text()').extract_first()
return item
I am aware that to further export/organize my scraper, I have to edit the pipeline settings (at least according to a big majority of articles I have read).
Below is my pipelines.py code:
class YahooscrapePipeline(object):
def process_item(self, item, spider):
return item
How can I set it so I can just execute the code, and it'll automatically write the code?
Update: I am using ScrapingHubs API, which runs off of shub-module to host my spider. It is very convenient, and easy to use.