0

I have been searching how to automate and write files to Excel in Scrapy (CSV). And so far, the only doable command is the tedious, manual method of:

scrapy crawl myscript -o myscript.csv -t csv

I want to be able to format each of these into a more collected "row" format. Further more, is there any way I can make the scraper automated? Ideally, I want the code to run once per day, and I want to be able to notify myself when there has been an update regarding my scrape. With update being a relevant post.

My spider is working, and here is the code:

import scrapy

from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem

class Spider(XMLFeedSpider):
    name = "Test"
    allowed_domains = ["yahoo.com"]
    start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=GOOGL',)
    itertag = 'item'

    def parse_node(self, response, node):
        item = {}
        item['title'] = node.xpath('title/text()',).extract_first()
        item['pubDate'] = node.xpath('link/pubDate/text()').extract_first()
        item['link'] = node.xpath('link/text()').extract_first()
        item['description'] = node.xpath('description/text()').extract_first()
        return item

I am aware that to further export/organize my scraper, I have to edit the pipeline settings (at least according to a big majority of articles I have read).

Below is my pipelines.py code:

class YahooscrapePipeline(object):
    def process_item(self, item, spider):
        return item

How can I set it so I can just execute the code, and it'll automatically write the code?

Update: I am using ScrapingHubs API, which runs off of shub-module to host my spider. It is very convenient, and easy to use.

Friezan
  • 41
  • 2
  • 7

2 Answers2

0

Scrapy itself does not handle periodic execution or scheduling. It is completely out of scrapy's scope. I'm afraid the answer will not be as simple as you want but is what's needed.

What you CAN do is: Use celerybeat to allow scheduling based on a crontab schedule. Running Celery tasks periodically (without Django) and http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html should get you started.

The other thing I suggest is that you host your spider in scrapyd. That will buy you log retention and a nice json api to use when you get more advanced:).

The stack overflow link gives you sample code for running celery without django (as a lot of examples assume django:) ). Remember to run the beat scheduler and not the task directly -- as pointed out in the link.

RabidCicada
  • 126
  • 6
  • Understood. Is the code stackable in my spider, or how exactly can I implement Celery? It seems a little daunting – Friezan Jun 13 '17 at 20:29
  • This is how I've seen it done before for directly calling your spider:https://stackoverflow.com/questions/22116493/run-a-scrapy-spider-in-a-celery-task – RabidCicada Jun 13 '17 at 21:17
  • The way I would take is to actually use scrapyd and scrapyd-client. Host your spider in scrapyd by running `scrapyd-deploy rabidtest -p rabidhire` Then use a scrapyd-client's api to run your spider from the celery task. No having to work around the reactor issue and much better decoupled. – RabidCicada Jun 13 '17 at 21:21
-2

As to your question referring to organizing the output of your seat, taking note the fact that you mentioned that you are familiar of how to use exporters, but then agree that create a custom CSV exporter and then have to register the fields to export in your settings.. The order they appear in your settings did the order they will be written into the CSV file.

If I misunderstood this part of the question and instead of horizontal you mean vertical alignment of your items, if you don't have many fields... Done correctly, quick hack add the regular expression \n for new line in your spider itemization... Would probably have to first to find the items 2 then add the new line OR \t for tab of which then you can add in items with what you have to find... I give an example but this being such a hacky thing to do... I'll spare you asinine.

As to schedule a spider.. Like they have mentioned there is Scrapyd of which i use together with scrapymon... But be warned, as of this moment Scrappyd has some compatibility issues so please do remember and force yourself to create a virtual environment for your scrapyd projects. There's a huge learning curve to getting scrapyd how you want it..

Using Django with celery is byfar TOP solution when your scraping gets serious.... Much higher learning curve is now you have to deal with server stuff, even more pain in the butt it's not a local server but old man... The speed of the cross and then custom integration or alteration of a web-based gui.If you don't want to mess with all that. What I did for a long time was used scrapinghub... get aquiated with their API... you can curl or use python modules they provide... and cron schedlue your spiders as you see fit right from you pc... scrape is done remotely so you keep resource power.

scriptso
  • 677
  • 4
  • 14