2

I'm an newer in Python and just wrote some spiders using scrapy. Now i want to active my spider using http request like this: http://xxxxx.com/myspidername/args

I used nginx + uwsgi + django to call my scrapy spider.

Steps:

  1. create&config django project

  2. create scrapy project in the django project root and write my spider

  3. start uwsgi: uwsgi -x django_socket.xml

  4. call my spider in the django app's views.py

    from django.http import HttpResponse
    from scrapy import cmdline
    def index(request, mid):
        cmd = "scrapy crawl myitem -a mid=" + mid
        cmdline.execute(cmd.split())
        return HttpResponse("Hello, it work")
    

when i visit the http://myhost/myapp/index pointed to the index view, the nginx return error page and the error log shows "upstream prematurely closed connection while reading response header from upstream", and i can see the process uwsgi dispeared, but in the uwsgi's log i can see my spider run correctly.

How can i fix this error?

Is this way right? any other way to do what i want?

Leon
  • 33
  • 5
  • Though I don't have time to investigate this, I'll post my intuitive response. Nginx is, afaik, an asynchronous server. That means you can't block to wait for input. As far as I see, the `cmdline.execute` thing blocks, so that might be inherently inpossible in nginx. What you should try is starting a new process from that view. Try experimenting with the `subprocess` module (in python standart library). http://stackoverflow.com/questions/3032805/starting-a-separate-process Try opening the scrapy spider in a new process. This means calling the `scrapy` executable in another shell. – vlad-ardelean Jan 28 '16 at 14:33
  • Thx! **subprocess** did the job! I used **subprocess.Popen(cmd.split())** and no error now. Still a little confused that my request time seems not so much different weather i use the **subprocess.wait()** or not. @vlad-ardelean – Leon Jan 29 '16 at 08:06
  • Cool, I'll add this as an answer, so other people find it faster. – vlad-ardelean Jan 29 '16 at 08:25

2 Answers2

1

I don't think it's such good idea to launch a spider tool inside django views. Django web app is meant to provide quick request/response to end users so that they could retrieve information quickly. Even if I'm not entirely sure what caused the error to happen, I would imagine that your view function would stuck in there as long as the spider don't finish.

There are two options here you could try to improve the user experience and minimize the error that could happen:

  1. crontab It runs your script regularly. It's reliable and easier for you to log and debug. But it's not flexible for scheduling and lack of control.

  2. celery This is pretty python/django specific tool that could schedule dynamically your task. You could define either crontab like tasks to run regularly, or apply a task at the run time. It won't block your view function and execute everything in a separate process, so it's most likely what you want. It needs some setup so it might be not straightforward at first. However, many people have used it and it works great once everything is in place.

Shang Wang
  • 24,909
  • 20
  • 73
  • 94
0

Nginx does asynchronous, non-blocking IO.

The call too scrapy.cmdline is synchronous. Most likely this messes up in the context of nginx.

Try opening a new process upon receiving the request.

There are many (well maybe not THAT many) ways to do this.

Try this question and its answers first: Starting a separate process

Community
  • 1
  • 1
vlad-ardelean
  • 7,480
  • 15
  • 80
  • 124