How to start my scrapy spider using http request?

Question

I'm an newer in Python and just wrote some spiders using scrapy. Now i want to active my spider using http request like this: http://xxxxx.com/myspidername/args

I used nginx + uwsgi + django to call my scrapy spider.

Steps:

create&config django project
create scrapy project in the django project root and write my spider
start uwsgi: uwsgi -x django_socket.xml

call my spider in the django app's views.py

from django.http import HttpResponse
from scrapy import cmdline
def index(request, mid):
    cmd = "scrapy crawl myitem -a mid=" + mid
    cmdline.execute(cmd.split())
    return HttpResponse("Hello, it work")

when i visit the http://myhost/myapp/index pointed to the index view, the nginx return error page and the error log shows "upstream prematurely closed connection while reading response header from upstream", and i can see the process uwsgi dispeared, but in the uwsgi's log i can see my spider run correctly.

How can i fix this error?

Is this way right? any other way to do what i want?

Though I don't have time to investigate this, I'll post my intuitive response. Nginx is, afaik, an asynchronous server. That means you can't block to wait for input. As far as I see, the `cmdline.execute` thing blocks, so that might be inherently inpossible in nginx. What you should try is starting a new process from that view. Try experimenting with the `subprocess` module (in python standart library). http://stackoverflow.com/questions/3032805/starting-a-separate-process Try opening the scrapy spider in a new process. This means calling the `scrapy` executable in another shell. — vlad-ardelean, Jan 28 '16 at 14:33
Thx! **subprocess** did the job! I used **subprocess.Popen(cmd.split())** and no error now. Still a little confused that my request time seems not so much different weather i use the **subprocess.wait()** or not. @vlad-ardelean — Leon, Jan 29 '16 at 08:06
Cool, I'll add this as an answer, so other people find it faster. — vlad-ardelean, Jan 29 '16 at 08:25

score 1 · Answer 1 · answered Jan 28 '16 at 14:43

I don't think it's such good idea to launch a spider tool inside django views. Django web app is meant to provide quick request/response to end users so that they could retrieve information quickly. Even if I'm not entirely sure what caused the error to happen, I would imagine that your view function would stuck in there as long as the spider don't finish.

There are two options here you could try to improve the user experience and minimize the error that could happen:

crontab It runs your script regularly. It's reliable and easier for you to log and debug. But it's not flexible for scheduling and lack of control.
celery This is pretty python/django specific tool that could schedule dynamically your task. You could define either crontab like tasks to run regularly, or apply a task at the run time. It won't block your view function and execute everything in a separate process, so it's most likely what you want. It needs some setup so it might be not straightforward at first. However, many people have used it and it works great once everything is in place.

score 0 · Accepted Answer · edited May 23 '17 at 12:23

0

Nginx does asynchronous, non-blocking IO.

The call too scrapy.cmdline is synchronous. Most likely this messes up in the context of nginx.

Try opening a new process upon receiving the request.

There are many (well maybe not THAT many) ways to do this.

Try this question and its answers first: Starting a separate process

edited May 23 '17 at 12:23

Community

1
1

answered Jan 29 '16 at 08:28

vlad-ardelean

7,480
15
80
124

How to start my scrapy spider using http request?

2 Answers2