I'm experimenting with my first small scraper in Python, and I want to use asyncio to fetch multiple websites simultaneously. I've already written a function that works with aiohttp, however since aiohttp.request() does not execute javascript this isn't ideal for scraping some dynamic web pages. So this motivates trying to use Selenium with PhantomJS as a headless browser.
There are a couple snippets of code demonstrating the use of BaseEventLoop.run_in_executor - such as here - however the documentation is sparse and my copy and paste mojo is not strong enough.
If someone would be kind enough to expand on the use of asyncio to wrap blocking calls in general, or explain what's going on in this specific case I'd appreciate it! Here is what I've knocked together so far:
@asyncio.coroutine
def fetch_page_pjs(self, url):
'''
(self, string, int) -> None
Performs async website content retrieval
'''
loop = asyncio.get_event_loop()
try:
future = loop.run_in_executor(None, self.driver.get, url)
print(url)
response = yield from future
print(response)
if response.status == 200:
body = BeautifulSoup(self.driver.page_source)
self.results.append((url, body))
else:
self.results.append((url, ''))
except:
self.results.append((url, ''))
Response returns 'None' - why?