4

I'm experimenting with my first small scraper in Python, and I want to use asyncio to fetch multiple websites simultaneously. I've already written a function that works with aiohttp, however since aiohttp.request() does not execute javascript this isn't ideal for scraping some dynamic web pages. So this motivates trying to use Selenium with PhantomJS as a headless browser.

There are a couple snippets of code demonstrating the use of BaseEventLoop.run_in_executor - such as here - however the documentation is sparse and my copy and paste mojo is not strong enough.

If someone would be kind enough to expand on the use of asyncio to wrap blocking calls in general, or explain what's going on in this specific case I'd appreciate it! Here is what I've knocked together so far:

@asyncio.coroutine
def fetch_page_pjs(self, url):
    '''
    (self, string, int) -> None
    Performs async website content retrieval
    '''
    loop = asyncio.get_event_loop()
    try:
        future = loop.run_in_executor(None, self.driver.get, url)
        print(url)
        response = yield from future
        print(response)
        if response.status == 200:
            body = BeautifulSoup(self.driver.page_source)
            self.results.append((url, body))
        else:
            self.results.append((url, ''))
    except:
        self.results.append((url, ''))

Response returns 'None' - why?

Community
  • 1
  • 1
Todd Howe
  • 77
  • 1
  • 7
  • I came across this question whilst searching for run_in_executor. I'm not sure if I have the answer, but are you expecting the `fetch_page_pjs` function to return the response? Currently it isn't returning anything, thus python will return `None`. It also appears by your use of `self` that this function is actually part of a class, and you never actually save the response to the class. Is that intended? Lastly, the first answer in the question you linked calls the equivalent of your `fetch_page_pjs` function as `loop.run_until_complete(main())`, ie in a loop. Are you doing this? – neRok Jul 14 '15 at 03:02

1 Answers1

4

This is not an asyncio or run_in_executor issue. The selenium api is simply not able to be used that way. First driver.get doesn't return anything. See the Docs for selenium. Second, it is not possible to get the status codes with selenium directly, see this stack overflow question

This code worked for me:

@asyncio.coroutine
def fetch_page_pjs(self, url):
    '''
    (self, string, int) -> None
    Performs async website content retrieval
    '''
    loop = asyncio.get_event_loop()
    try:
        future = loop.run_in_executor(None, self.driver.get, url)
        print(url)
        yield from future
        body = BeautifulSoup(self.driver.page_source)
        self.results.append((url, body))

    except:
        self.results.append((url, ''))
Community
  • 1
  • 1
Dave Butler
  • 1,646
  • 1
  • 12
  • 18