My practice now:
I let my backend to catch the get request sent by the front-end page to run my scrapy spider, everytime the page is refreshed or loaded. The crawled data will be shown in my front page. Here's the code, I call a subprocess to run the spider:
from subprocess import run
@get('/api/get_presentcode')
def api_get_presentcode():
if os.path.exists("/static/presentcodes.json"):
run("rm presentcodes.json", shell=True)
run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
with open("/static/presentcodes.json") as data_file:
data = json.load(data_file)
logging.info(data)
return data
It works well.
What I want:
However, the spider crawls a website which barely changes, so it's no need to crawl that often.
So I want to run my scrapy spider every 30 minutes using the coroutine method just at backend.
What I tried and succeeded:
from subprocess import run
# init of my web application
async def init(loop):
....
async def run_spider():
while True:
print("Run spider...")
await asyncio.sleep(10) # to check results more obviously
loop = asyncio.get_event_loop()
tasks = [run_spider(), init(loop)]
loop.run_until_complete(asyncio.wait(tasks))
loop.run_forever()
It works well too.
But when I change the codes of run_spider()
into this (which is basically the same as the first one):
async def run_spider():
while True:
if os.path.exists("/static/presentcodes.json"):
run("rm presentcodes.json", shell=True)
run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
await asyncio.sleep(20)
the spider was run only at the first time and crawled data was stored to presentcode.json successfully, but the spider never called after 20 seconds later.
Questions
What's wrong with my program? Is it because I called a subprocess in a coroutine and it is invalid?
Any better thoughts to run a spider while the main application is running?
Edit:
Let me put the code of my web app init function here first:
async def init(loop):
logging.info("App started at {0}".format(datetime.now()))
await orm.create_pool(loop=loop, user='root', password='', db='myBlog')
app = web.Application(loop=loop, middlewares=[
logger_factory, auth_factory, response_factory
])
init_jinja2(app, filters=dict(datetime=datetime_filter))
add_routes(app, 'handlers')
add_static(app)
srv = await loop.create_server(app.make_handler(), '127.0.0.1', 9000) # It seems something happened here.
logging.info('server started at http://127.0.0.1:9000') # this log didn't show up.
return srv
My thought is, the main app made coroutine event loop 'stuck' so the spider cannot be callback later after.
Let me check the source code of create_server
and run_until_complete
..