0

this is a program input multiple urls calling url localhost:8888/api/v1/crawler

this program taking 1+hour to run its ok but it block other apis. when it running other any api will not work till the existing api end so i want to run this program asynchronously so how can i achieve with the same program

@tornado.web.asynchronous
    @gen.coroutine
    @use_args(OrgTypeSchema)
    def post(self, args):
        print "Enter In Crawler Match Script POST"
        print "Argsssss........"
        print args
        data = tornado.escape.json_decode(self.request.body)
        print "Data................"
        import json
        print json.dumps(data.get('urls'))
        from urllib import urlopen
        from bs4 import BeautifulSoup
        try:
                urls = json.dumps(data.get('urls'));
                urls  = urls.split()

                import sys

                list = [];

                # orig_stdout = sys.stdout
                # f = open('out.txt', 'w')
                # sys.stdout = f
                for url in urls:
                    # print "FOFOFOFOFFOFO"
                    # print url
                    url = url.replace('"'," ")
                    url = url.replace('[', " ")

                    url = url.replace(']', " ")
                    url = url.replace(',', " ")
                    print "Final Url "
                    print url
                    try:
                        site = urlopen(url) ..............
rohit
  • 199
  • 4
  • 13

2 Answers2

0

Your post method is 100% synchronous. You should make the site = urlopen(url) async. There is an async HTTP client in Tornado for that. Also good example here.

Fine
  • 2,114
  • 1
  • 12
  • 18
0

You are using urllib which is the reason for blocking.

Tornado provides a non-blocking client called AsyncHTTPClient, which is what you should be using.

Use it like this:

from tornado.httpclient import AsyncHTTPClient

@gen.coroutine
@use_args(OrgTypeSchema)
def post(self, args):
    ...
    http_client = AsyncHTTPClient()
    site = yield http_client.fetch(url)
    ...

Another thing that I'd like to point out is don't import modules from inside a function. Although, it's not the reason for blocking but it is still slower than if you put all your imports at the top of file. Read this question.

xyres
  • 20,487
  • 3
  • 56
  • 85