3

I've written a script in python in combination with pyppeteer along with asyncio to scrape the links of different posts from its landing page and eventually get the title of each post by tracking the url leading to its inner page. The content I parsed here are not dynamic ones. However, I made use of pyppeteer and asyncio to see how efficiently it performs asynchronously.

The following script goes well for some moments but then enounters an error:

File "C:\Users\asyncio\tasks.py", line 526, in ensure_future
raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required

This is what I've wriiten so far:

import asyncio
from pyppeteer import launch

link = "https://stackoverflow.com/questions/tagged/web-scraping"

async def fetch(page,url):
    await page.goto(url)
    linkstorage = []
    elements = await page.querySelectorAll('.summary .question-hyperlink')
    for element in elements:
        linkstorage.append(await page.evaluate('(element) => element.href', element))
    tasks = [await browse_all_links(link, page) for link in linkstorage]
    results = await asyncio.gather(*tasks)
    return results

async def browse_all_links(link, page):
    await page.goto(link)
    title = await page.querySelectorEval('.question-hyperlink','(e => e.innerText)')
    print(title)

async def main(url):
    browser = await launch(headless=True,autoClose=False)
    page = await browser.newPage()
    await fetch(page,url)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(main(link))
    loop.run_until_complete(future)
    loop.close()

My question: how can I get rid of that error and do the doing asynchronously?

robots.txt
  • 96
  • 2
  • 10
  • 36

1 Answers1

4

The problem is in the following lines:

tasks = [await browse_all_links(link, page) for link in linkstorage]
results = await asyncio.gather(*tasks)

The intention is for tasks to be a list of awaitable objects, such as coroutine objects or futures. The list is to be passed to gather, so that the awaitables can run in parallel until they all complete. However, the list comprehension contains an await, which means that it:

  • executes each browser_all_links to completion in series rather than in parallel;
  • places the return values of browse_all_links invocations into the list.

Since browse_all_links doesn't return a value, you are passing a list of None objects to asyncio.gather, which complains that it didn't get an awaitable object.

To resolve the issue, just drop the await from the list comprehension.

user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • 1
    I complied with your suggestion and ran the code @user4815162342. This time it gives me another error. The thing is it collects a single title 15 times whereas it is supposed to produce 15 different titles. [Full traceback can be found here](https://pastebin.com/mmVx0euN). Thanks. – robots.txt Dec 14 '18 at 06:28
  • 1
    @robots.txt Maybe the problem is that you're also sending the same `page` to all coroutines, so they're stepping on each other's toes? You might need to create the `page` in `browse_all_links` instead of in `main`. – user4815162342 Dec 14 '18 at 07:40
  • 1
    This time I've tried to modify my script exactly as you suggested but still 'm getting a single link 15 times as I mentioned earlier. ***[Check out my current script here](https://pastebin.com/28e3L44V)***. Thanks a lot @ user4815162342. – robots.txt Dec 14 '18 at 08:31
  • 1
    @robots.txt You're still sending the same `page` object to all `browse_all_links`. You need to move page creation, i.e. `page = await browser.newPage()`, to `browse_all_links`. – user4815162342 Dec 14 '18 at 09:08
  • 2
    Now, it appears to be working correctly @ user4815162342. Please ***[check out this link](https://pastebin.com/t7hKGqgv)*** to see what I've done finally. I'm still unsure I could follow your instruction properly. From the very begining I was trying not to create new browser for new urls as they put a huge load on computer. However, is this what you meant. Thanks a zillion to be with me so far. – robots.txt Dec 14 '18 at 09:55
  • @robots.txt That's what I meant. And, for what it's worth, you're still not creating a new browser per URL, you're creating a new _page_ per URL (I'm not sure if this helps with resource usage since I don't use `pyppeteer`.) Since the page provides the stateful `goto` API, you obviously cannot *both* share the same page among the `browse_all_links` coroutines *and* have them execute in parallel. – user4815162342 Dec 14 '18 at 10:03