1

I've studied sync and async in JavaScript. I'm going to make a crawling program using Puppeteer.

There are many code examples of crawling in Puppeteer.

But, I have one question: Why do they use async in basic Puppeteer example scripts?

Can't I use sync programming in Puppeteer? Is there an issue that I don't know about that makes async necessary?

It doesn't seem useful if I don't use multiple threads (multi-crawling).

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Puppeteer has to communicate with the browser through a network connection, and because the network can be slow, this is made asynchronous. You can't make asynchronous stuff synchronous, but if you're doing things sequentially, you can just wrap the entire code in an `async` function, put `await` in front of Puppeteer calls that require it, and you're done. – FZs Mar 06 '22 at 07:11
  • @FZs If the network slows down, shouldn't we have to wait for it anyway? That's why we use synchronous. – Good Tester Mar 06 '22 at 07:21
  • @FZs you mean. If we use synchronous in puppeteer, Do we have to make more exceptions? I couldn't understand about ("he network can be slow, this is made asynchronous.") – Good Tester Mar 06 '22 at 07:24
  • You do want to await everything asynchronous, but if it was synchronous and your code wanted to do some other stuff while waiting for Puppeteer, it couldn't do it because Puppeteer would block the event loop. And even though your code doesn't want to do anything while waiting, it can use it the same way. – FZs Mar 06 '22 at 08:59

2 Answers2

2

For starters, I recommend reading How the single threaded non blocking IO model works in Node.js. This thread motivates the callback and promise-based models Node provides for achieving concurrency.

Whenever the Node process needs to access an out-of-process resource such as the file system or a network socket (as Puppeteer does to communicate with the browser it's connected to), there are two options:

  1. Block the whole process and wait for the response, as fs.readFileSync does.
  2. Use a promise or a callback to be notified of the response and go about other things, as fs.readFile (either via callback or fs.promises) and Puppeteer do.

The first option is a poor choice, with the only advantage being easier syntax to write. Blocking the thread to wait for a resource is like ordering a pizza, then doing nothing until the pizza arrives. You might as well read a book or water your plants while you wait.

Historically, callbacks were originally the only way to write concurrent code in Node. Eventually, promises and then arrived, which were better, but still posed readability burdens. With the advent of async/await, it's no longer difficult to write asynchronous code that reads like synchronous code. Synchronous APIs like fs's __Sync functions that alias an asynchronous API are historical artifacts. It's normal that Puppeteer doesn't offer page.waitForSelectorSync, page.$evalSync, etc.

Now, it's understandable to think that Puppeteer's asynchronous API is pointless in a simple, straight-line script since your Node process doesn't have anything else to do while awaiting responses, but having to type await for each call is the least evil of the available design options for the API.

Simply not awaiting promises isn't an option even when a script is a single sequence of straight-line code. Without await, ordering of operations/results becomes nondeterministic as each promise runs concurrently, independent of the others. This interleaving would be unintended in sequential code, but is a useful tool in cases when concurrency is desired.

For the authors of an asynchronous API where almost all calls are accessesing an external resource, as is the case with Puppeteer, the options are:

  1. Write and maintain two versions of the API, a synchronous and an asynchronous version. No libraries that I know of do this -- it's a major pain with little benefit and plenty of room for misuse.
  2. Write and maintain a synchronous API only to cater to the simple use case at the expense of making the library virtually unusable for anyone that cares about concurrency. Clearly, this is horrible design, like forcing everyone who orders a pizza (in the above real-world example) to do nothing until it arrives.
  3. Write and maintain one asynchronous API, and make clients who don't care about concurrency in a particular program have to write await in front of all the calls. That's what Puppeteer does.

Incidentally, the fact that the browser is in a separate process tends to cause all manner of confusion in Puppeteer beginners. For example, the fact that data is serialized and deserialized (converted to a string) on every call to page.evaluate (and family) means that you can't pass complex structures like DOM nodes across the inter-process gap. You can't access variables you've defined in Node from the body of an evaluate callback without passing them as arguments to the evaluate call, and these variables need to be able to respond correctly to JSON.stringify() (that is, be serializable).

Just 13 hours before this post, someone asked node.js puppeteer "document is not defined" -- they were trying to access the browser process' document object inside of Node.

If you're on Windows, try running a simple Puppeteer Node script that doesn't close the browser, then look at your task manager. On Linux, you can run ps -a. You'll see that there's a Chromium browser and a Node process. The two processes communicate over a socket, which has much higher latency than intra-process communication and involves the operating system's network stack. Every Puppeteer call provides an opportunity for concurrency that'd be lost if Puppeteer's API was synchronous.

Understanding the inter-process gap is critical to success in Puppeteer because it motivates why the API calls are asynchronous, and helps clarify which code is executing in which process.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
0

async is very important for data fetching/crawling. You can imagine this case, you have 1 element is book-container, but inside book-container, it will have book data coming later on UI with API fetch.

const scraperObject = {
    url: 'http://book-store.com',
    scraper(browser){
        let page = browser.newPage();
        page.goto(this.url);
        page.waitForSelector('.book-container');
        page.waitForSelector('.book');
        //TODO: save book data after this
        });
    }
}

With this code snippet, it will run like this

  1. page.goto(this.url) Go to the page with certain URL
  2. page.waitForSelector('.book-container') No async here, so it will try to get .book-container element immediately (of course, it won't be there because the page is possibly still loading due to some network problem)
  3. page.waitForSelector('.book') Similarly, it try to get book data immediately (even though book-container has not been in HTML yet)

To solve this problem, we should have async to WAIT for elements ready in HTML.

const scraperObject = {
    url: 'http://book-store.com',
    async scraper(browser){
        let page = await browser.newPage();
        await page.goto(this.url);
        await page.waitForSelector('.book-container');
        await page.waitForSelector('.book');
        //TODO: save book data after this
        });
    }
}

Explain it again with async/await.

  1. page.goto(this.url) Go to the page with certain URL and wait till the page loaded
  2. page.waitForSelector('.book-container') Wait till .book-container element appears in HTML
  3. page.waitForSelector('.book') Wait till .book element appears in HTML (we can understand that API's data responded)
Nick Vu
  • 14,512
  • 4
  • 21
  • 31
  • If I don't use a async, then... ` ` ` page.goto(this.url); (Do I have to use a lot of exceptions in here, right?) page.waitForSelector('.book-container'); ` ` ` – Good Tester Mar 06 '22 at 07:35
  • yeah it is, but it's good to have `async/await` in Puppeteer's functions to make sure your logic is working properly under bad network connections – Nick Vu Mar 06 '22 at 07:47
  • This isn't correct. The reason Puppeteer is async is the same basic reason as why you use async on `fs`, `fetch` and `child_process` calls. It has nothing to do with "bad" network connections, it has to do with network connections, full stop. Since the browser is running in another process, Node has to go through a socket in the OS to talk to it via the network stack, so there's some time on each Puppeteer call where the Node thread is idle and could be doing other meaningful work instead of blocking. – ggorlen Mar 06 '22 at 15:12
  • @ggorlen it's true once you mentioned the basic knowledge for most `Promise` calls, but he's asking why we need to have `async` (of course it goes along with `await`) in Puppeteer. If we don't have it, what will happen? We know that `async` can help to run in another process for not blocking the main thread, but without `async/await` for all these calls, do you think these testing steps will pass smoothly? – Nick Vu Mar 06 '22 at 15:32
  • I see what you mean, but I don't think that's what OP is asking. I think OP wants to know why the Puppeteer API is almost totally asynchronous in the first place, not why you need to `await` promises to keep execution order correct within an asynchronous API. – ggorlen Mar 06 '22 at 15:44
  • @ggorlen `Can't I use sync programming in Puppeteer? Is there an issue that I don't know about that makes async necessary?` I covered most of this part of his question tho. If he is not satisfied, I'd like to help him to understand more. – Nick Vu Mar 06 '22 at 15:50
  • Yeah, I see where you're coming from, but I suggest removing the stuff about bad network connections, which is misleading. I'll remove the downvote if edited -- it's not incorrect as I originally supposed, just a different interpretation of OP's intent than I'd read. I also edited OP's post for clarify and might have subtly changed the meaning to better match my reading of it, not realizing "why do I need to `await` promises?" might have been the original intent. – ggorlen Mar 06 '22 at 16:34
  • Thanks for your suggestion, @ggorlen! I just stroke-through that part in my answer to make it less confusing. – Nick Vu Mar 06 '22 at 16:40
  • Thanks, but I'd just remove it entirely. Without reading the comments (which are intended to be temporary), readers would be confused about why it was striked out. The promise interleaving can occur regardless of network problems if not awaited. – ggorlen Mar 06 '22 at 17:42