Run spider while web application running based on Python asyncio module

Question

My practice now:

I let my backend to catch the get request sent by the front-end page to run my scrapy spider, everytime the page is refreshed or loaded. The crawled data will be shown in my front page. Here's the code, I call a subprocess to run the spider:

from subprocess import run

@get('/api/get_presentcode')
def api_get_presentcode():
    if os.path.exists("/static/presentcodes.json"):
        run("rm presentcodes.json", shell=True)

    run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
    with open("/static/presentcodes.json") as data_file:
        data = json.load(data_file)

    logging.info(data)
    return data

It works well.

What I want:

However, the spider crawls a website which barely changes, so it's no need to crawl that often.

So I want to run my scrapy spider every 30 minutes using the coroutine method just at backend.

What I tried and succeeded:

from subprocess import run

# init of my web application
async def init(loop):
....

async def run_spider():
    while True:
        print("Run spider...")
        await asyncio.sleep(10)  #  to check results more obviously 

loop = asyncio.get_event_loop()
tasks = [run_spider(), init(loop)]
loop.run_until_complete(asyncio.wait(tasks))
loop.run_forever()

It works well too.

But when I change the codes of run_spider() into this (which is basically the same as the first one):

async def run_spider():
    while True:
        if os.path.exists("/static/presentcodes.json"):
            run("rm presentcodes.json", shell=True)

        run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
        await asyncio.sleep(20)

the spider was run only at the first time and crawled data was stored to presentcode.json successfully, but the spider never called after 20 seconds later.

Questions

What's wrong with my program? Is it because I called a subprocess in a coroutine and it is invalid?
Any better thoughts to run a spider while the main application is running?

Edit:

Let me put the code of my web app init function here first:

async def init(loop):
    logging.info("App started at {0}".format(datetime.now()))
    await orm.create_pool(loop=loop, user='root', password='', db='myBlog')
    app = web.Application(loop=loop, middlewares=[
        logger_factory, auth_factory, response_factory
    ])
    init_jinja2(app, filters=dict(datetime=datetime_filter))
    add_routes(app, 'handlers')
    add_static(app)
    srv = await loop.create_server(app.make_handler(), '127.0.0.1', 9000)  # It seems something happened here.
    logging.info('server started at http://127.0.0.1:9000') # this log didn't show up.
    return srv

My thought is, the main app made coroutine event loop 'stuck' so the spider cannot be callback later after.

Let me check the source code of create_server and run_until_complete..

This is a Python/async question. Not a scrapy one :) On the principles, don't do `rm` and then `scrapy crawl` because then you will have some time where there will be no file and your requests will be failing. Do the `scrapy crawl` first writing to a temp file e.g. `/static/presentcodes.json.tmp` and then do an `mv /static/presentcodes.json.tmp /static/presentcodes.json` which is atomic-[ish](http://stackoverflow.com/questions/18706419/is-a-move-operation-in-unix-atomic) — neverlastn, Jan 05 '17 at 12:20
@neverlastn You mean which _request_ will be failing? The REST get request from web page? Actually, I'm not using request now, I am just making the _main web application_ and _spider_ run at same time. The file reading and data showing at frontend page will be considered later, now they are both "commented". Anyway, I will try your solution first, thanks! — Spike, Jan 05 '17 at 17:18
Exactly, you're right! Nothing failing now but with `rm`+`crawl` you will have moments of unavailability as soon as you try to do anything with that file (unless you have relatively complex synchronization) — neverlastn, Jan 05 '17 at 19:33

MarSoft · Answer 1 · 2017-01-29T01:02:53.533

Probably not a complete answer, and I would not do it like you do. But calling subprocess from within an asyncio coroutine is definitely not correct. Coroutines offer cooperative multitasking, so when you call subprocess from within a coroutine, that coroutine effectively stops your whole app until called process is finished.

One thing you need to understand when working with asyncio is that control flow can be switched from one coroutine to another only when you call await (or yield from, or async for, async with and other shortcuts). If you do some long action without calling any of those then you block any other coroutines until this action is finished.

What you need to use is asyncio.subprocess which will properly return control flow to other parts of your application (namely webserver) while the subprocess is running.

Here is how actual run_spider() coroutine could look:

import asyncio

async def run_spider():
    while True:
        sp = await asyncio.subprocess.create_subprocess_shell(
            "scrapy srawl presentcodespider -o ../static/presentcodes.new.json",
            cwd="./presentcodeSpider")
        code = await sp.wait()
        if code != 0:
            print("Warning: something went wrong, code %d" % code)
            continue  # retry immediately
        if os.path.exists("/static/presentcodes.new.json"):
            # output was created, overwrite older version (if any)
            os.rename("/static/presentcodes.new.json", "/static/presentcodes.json")
        else:
            print("Warning: output file was not found")

        await asyncio.sleep(20)