How to implement concurrency in scrapy (Python) middleware

Question

Edit 2

Second approach. For now, I gave up on using multiple instances and configured scrapy settings not to use concurrent requests. It's slow but stable. I opened a bounty. Who can help to make this work concurrently? If I configure scrapy to run concurrently, I get segmentation faults.

class WebkitDownloader( object ):

    def __init__(self):
        os.environ["DISPLAY"] = ":99"
        self.proxyAddress = "a:b@" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)


    def process_response(self, request, response, spider):
        self.request = request
        self.response = response
        if 'cached' not in response.flags:
            webkitBrowser = webkit.WebkitBrowser(proxy = self.proxyAddress, gui=False, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
            #print "added to queue: " + str(self.counter)
            webkitBrowser.get(html=response.body, num_retries=0)
            html = webkitBrowser.current_html()
            respcls = responsetypes.from_args(headers=response.headers, url=response.url)
            kwargs = dict(cls=respcls, body=killgremlins(html))
            response = response.replace(**kwargs)
            webkitBrowser.setPage(None)
            del webkitBrowser
        return response

Edit:

I tried to answer my own question in the meantime and implemented a queue but it does not run asynchronously for some reason. Basically when webkitBrowser.get(html=response.body, num_retries=0) is busy, scrapy is blocked until the method is finished. New requests are not assigned to the remaining free instances in self.queue.

Can anyone please point me into right direction to make this work?

class WebkitDownloader( object ):

    def __init__(self):
        proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
        self.queue = list()
        for i in range(8):
            self.queue.append(webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=5.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt']))

    def process_response(self, request, response, spider):

        i = 0
        for webkitBrowser in self.queue:
            i += 1
            if webkitBrowser.status == "WAITING":
                break
        webkitBrowser = self.queue[i]

        if webkitBrowser.status == "WAITING":
            # load webpage
            print "added to queue: " + str(i)
            webkitBrowser.get(html=response.body, num_retries=0)
            webkitBrowser.scrapyResponse = response

        while webkitBrowser.status == "PROCESSING":
            print "waiting for queue: " + str(i)  

        if webkitBrowser.status == "DONE":
            print "fetched from queue: " + str(i)
            #response = webkitBrowser.scrapyResponse
            html = webkitBrowser.current_html()
            respcls = responsetypes.from_args(headers=response.headers, url=response.url)
            kwargs = dict(cls=respcls, body=killgremlins(html))
            #response = response.replace(**kwargs)
            webkitBrowser.status = "WAITING"
            return response

I am using WebKit in a scrapy middleware to render JavaScript. Currently, scrapy is configured to process 1 request at a time (no concurrency).

I'd like to use concurrency (e.g. 8 requests at a time) but then I need to make sure that 8 instances of WebkitBrowser() receive requests based on their individual processing state (a fresh request as soon as WebkitBrowser.get() is done and ready to receive the next request)

How would I achieve that with Python? This is my current middleware:

class WebkitDownloader( object ):

    def __init__(self):
        proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
        self.w = webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])

    def process_response(self, request, response, spider):
        if not ".pdf" in response.url:
            # load webpage
            self.w.get(html=response.body, num_retries=0)
            html = self.w.current_html()
            respcls = responsetypes.from_args(headers=response.headers, url=response.url)
            kwargs = dict(cls=respcls, body=killgremlins(html))
            response = response.replace(**kwargs)

        return response

score 3 · Accepted Answer · edited May 23 '17 at 11:45

I don't follow everything in your question because I don't know scrapy and I don't understand what would cause the segfault, but I think I can address one question: why is scrapy blocked when webkitBrowser.get is busy?

I don't see anything in your "queue" example that would give you the possibility of parallelism. Normally, one would use either the threading or multiprocessing module so that multiple things can run "in parallel". Instead of simply calling webkitBrowser.get, I suspect that you may want to run it in a thread. Retrieving web pages is a case where python threading should work reasonably well. Python can't do two CPU-intensive tasks simultaneously (due to the GIL), but it can wait for responses from web servers in parallel.

Here's a recent SO Q/A with example code that might help.

Here's an idea of how to get you started. Create a Queue. Define a function which takes this queue as an argument, get's the web page and puts the response in the queue. In the main program, enter a while True: loop after spawning all the get threads: check the queue and process the next entry, or time.sleep(.1) if it's empty.

The problem is that if I use multithreading, I need to first create a callback that finally returns the scraped item once it has been processed by webkitBrowser.get() Any suggestions how to do that? I'd be happy to attribute the bounty points to you, if you can get me on the right track. — Jabb, Mar 16 '14 at 10:06
@Jabb Don't worry about the bounty. I don't know enough about scrapy and webkit to give you a full example, but I think I can get you started. Create a http://docs.python.org/2.7/library/queue.html Define a function which takes this queue as an argument, `get`'s the web page and puts the response in the queue. In the main program, enter a `while True:` loop after spawning all the `get` threads: check the queue and process the next entry, or `time.sleep(.1)` if it's empty. I'll add this idea to my answer if this sounds promising to you. — jrennie, Mar 16 '14 at 11:18

score 1 · Answer 2 · edited May 23 '17 at 11:50

I am aware this is an old question, but I had the similar question and hope this information I stumbled upon helps others with this similar question:

If scrapyjs + splash works for you (given you are using a webkit browser, they likely do, as splash is webkit-based), it is probably the easiest solution;
If 1 does not work, you may be able to run multiple spiders at the same time with scrapyd or do multiprocessing with scrapy;
Depending on wether your browser render is primarily waiting (for pages to render), IO intensive or CPU-intensive, you may want to use non-blocking sleep with twisted, multithreading or multiprocessing. For the latter, the value of sticking with scrapy diminishes and you may want to hack a simple scraper (e.g. the web crawler authored by A. Jesse Jiryu Davis and Guido van Rossum: code and document) or create your own.

How to implement concurrency in scrapy (Python) middleware

2 Answers2