Automatic Schedule Scrapy Crawler With Amazon Web Services

Question

I have a crawler/spider using Python's Scrapy, and I want to schedule a daily crawl with it using Amazon Web Services.

What I would like to do, is that every day, let's say 01:00 UTC. I want a ec2 instance to be created and launch the Scrapy spider and run the crawl, and when it's done I want the ec2 instance to be terminated.

I don't want the ec2 instance to be left up and running and add additional costs/fees because in the future I will add more spiders and it can result in a dozen of passive instances that do nothing 20h per day.

I found a couple of posts talking about using Scrapy with ec2:

But all of them seem to require you to launch that script from your local computer every time you want to schedule a crawl. It does not seem to be done automatically. I want my script to run 365 days a year, for 10+ years time, and I don't want to do it every night before I go to bed.

Can someone describe how this is done using Amazon Web Services?

score 3 · Answer 1 · edited May 23 '17 at 12:08

3

I think using crontab or python-scheduler along with the scrapyd will do the trick

edited May 23 '17 at 12:08

Community

1
1

answered Apr 08 '15 at 08:23

Jithin

1,692
17
25

Automatic Schedule Scrapy Crawler With Amazon Web Services

1 Answers1