0

I've got a pretty simple class that I'm trying to use Puppeteer within, but no matter what I do the async code just doesn't seem to execute when I put a breakpoint on it.

The let data = await page.$$eval will execute and then literally nothing happens after that. The code doesn't even step into the inner function block.

Surely the await on that line should force the inner async block to execute before it moves onto the console log at the bottom?


let url = "https://www.ikea.com/gb/en/p/godmorgon-high-cabinet-brown-stained-ash-effect-40457851/";
let scraper = new Scraper();
scraper.launch(url);

export class Scraper{
    constructor(){}

    async launch(url: string){
        let browser = await puppeteer.launch({});
        let page = await browser.newPage();
        await page.goto(url);

        let data = await page.$$eval(' body *', elements => {
            console.log("Elements: ", elements);
            elements.forEach(element => {
                console.log("Element: ", element.className);
            })
            return "done";
        })

        console.log("Data: ", data);
    }
}

I'm trying to follow this tutorial.

I even copied this block of code directly from the tutorial but still it doesn't work.

await page.goto(this.url);
        // Wait for the required DOM to be rendered
        await page.waitForSelector('.page_inner');
        // Get the link to all the required books
        let urls = await page.$$eval('section ol > li', links => {
            // Make sure the book to be scraped is in stock
            links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
            // Extract the links from the data
            links = links.map(el => el.querySelector('h3 > a').href)
            return links;
        });
        console.log(urls);
jm123456
  • 509
  • 1
  • 8
  • 20
  • 3
    "*The code doesn't even step into the inner function block.*" because the inner function block is *not* executed in the current environment. Puppeteer serialises it and then runs it in a separate instance. – VLAZ Mar 05 '21 at 12:29
  • @VLAZ Well how am I able to debug my own code then? – jm123456 Mar 05 '21 at 12:29
  • I guess you can open the page and run it in the browser console, then give it to Puppeteer. I'm not sure if there is any better tool. But the way Puppeteer works, it *cannot* "run" this code. Because it opens the page in a completely different page and gives you outside control of it. Executing code on the page has to be done in a different environment. The way Puppeteer seems to do it is just taking the source code and recreating the function for the page. This, among other things, means it destroys references captured by closures. – VLAZ Mar 05 '21 at 12:33
  • @VLAZ Okay I understand, that makes a lot more sense. How can I return a list of all the elements on a page back from this `page.$$eval` function then? – jm123456 Mar 05 '21 at 12:34
  • 2
    I don't think you can get the *elements* themselves. You can certainly pass around simple values like primitives and plain objects/arrays. But complex objects will still be mangled by the serialisation/deserialisation between the environments. The tutorial shows how you can get `href`s - you can do the same but with `className`s or `value`s, etc. `return Array.from(elements).map(x => x.className)` for example. – VLAZ Mar 05 '21 at 12:37
  • So the idea is that I need to evaluate the page to get back a collection of elements, and then I need to parse the information I need from the elements and return that list of reduce information? – jm123456 Mar 05 '21 at 13:16
  • `await page.$$eval('section ol > li', links => {` You are awaiting a function _and_ also passing this function a callback. This makes no sense. Either your function returns a Promise, and therefore can be awaited, OR it takes a callback function, not both. – Jeremy Thille Mar 05 '21 at 13:18
  • 1
    @JeremyThille I've explained it above. Puppeteer *cannot* run the callback in the same context as the currently running code. `$$eval` itself returns a promise containing the result of running the callback but it does this by serialising/deserialising the function and the result in order to pass them back and forth. The tutorial linked in the question also shows exactly this usage `let urls = await page.$$eval('section ol > li', links => { /* ...*/ })`. – VLAZ Mar 05 '21 at 13:32
  • 1
    @jm123456 correct. Processing should happen in the callback. The result you pass back should be very simple, e.g., an array of strings, an object with values that are numbers, etc. Basically anything that will not lose information if you do `JSON.parse(JSON.stringify(result))`. – VLAZ Mar 05 '21 at 13:34
  • 2
    @VLAZ - or a bit more complex structures since there is a possibility to pass around `ElementHandle` instances. @jm123456 - you can launch the browser in non-headless mode and simply open devtools while the page is open - you will see whether the evaluation happens or not. You can also intercept page console logs by listening on a `console` event from the page – Oleg Valter is with Ukraine Mar 05 '21 at 13:37
  • 1
    @OlegValter ah, I wasn't aware about that. I hadn't seen good documentation on how serialisation/deserialisation happens so I was trying to keep it safe to limit to stuff that can pass through JSON. Makes sense that Puppeteer probably doesn't just do JSON but I wasn't sure where to find the information for that. – VLAZ Mar 05 '21 at 13:39
  • 1
    @VLAZ - well, afaik these handles are basically JSON objects describing nodes in the page's DOM, so your point stands true - only serializable values can be passed through between the browser context and the node context. jm123456 - if it helps, your code looks just fine to me, so something probably went wrong on the page itself. In the case of tutorial code - make sure the DOM structure is the same as at the time the tutorial was written and selectors match what they are supposed to match – Oleg Valter is with Ukraine Mar 05 '21 at 13:46
  • As others mentioned, you're not seeing the logs because they're executed in the browser, not in the Node environment Puppeteer runs in. [Add a handler to log them](https://stackoverflow.com/a/60075804/6243352) and you'll see the code executes. The caveat is that some of the objects you're logging will cause a `Protocol error (Runtime.callFunctionOn): Object reference chain is too long`, so take it easy. – ggorlen Mar 05 '21 at 15:00
  • Does this answer your question? [How do print the console output of the page in puppeter as it would appear in the browser?](https://stackoverflow.com/questions/58089425/how-do-print-the-console-output-of-the-page-in-puppeter-as-it-would-appear-in-th) – ggorlen Mar 05 '21 at 15:05
  • @jm123456 - well, I can't offer you anything better than Puppeteers API [docs](https://github.com/puppeteer/puppeteer/blob/v5.5.0/docs/api.md), really. `ElementHandle`s can be used to streamline processing of nodes where it is possible (Web APIs, for example still have to be in browser context for obvious reasons). For example, you could use a `$` method, get a handle and call [`getProperties`/`getProperty`](https://github.com/puppeteer/puppeteer/blob/v5.5.0/docs/api.md#elementhandlegetproperties) in node to get `href` without leaving the context – Oleg Valter is with Ukraine Mar 05 '21 at 15:45

0 Answers0