20

I am trying to get a value from inside page.evaluate() body in my YouTube scraper that I've built using Puppeteer. I am unable to return the result from page.evaluate(). How do I achieve this? Here's the code:

let boxes2 = []
        const getData = async() => {
            return await page.evaluate(async () => { // scroll till there's no more room to scroll or you get at least 250 boxes  
                console.log(await new Promise(resolve => {

                    var scrolledHeight = 0  
                    var distance = 100 
                    var timer = setInterval(() => {
                        boxes = document.querySelectorAll("div.style-scope.ytd-item-section-renderer#contents > ytd-video-renderer > div.style-scope.ytd-video-renderer#dismissable")
                        console.log(`${boxes.length} boxes`)
                        var scrollHeight = document.documentElement.scrollHeight
                        window.scrollBy(0, distance)
                        scrolledHeight += distance
                        if(scrolledHeight >= scrollHeight || boxes.length >= 50){
                            clearInterval(timer)
                            resolve(Array.from(boxes))
                        }
                    }, 500)
                }))
            })
        }
        boxes2 = await getData()
        console.log(boxes2)

The console.log wrapping the promise prints the resulting array in the browser's console. I just cannot get that array in boxes2 down where I'm calling the getData() function. I feel like I'm missing out on a tiny little bit, but can't figure out what it is. Appreciate any tip here.

Rohit
  • 1,385
  • 2
  • 15
  • 21

3 Answers3

26

The little issue is that you don't actually return the data from inside of page.evaluate:

const getData = () => {
    return page.evaluate(async () => { 
        return await new Promise(resolve => { // <-- return the data to node.js from browser
            // scraping
        }))
    })
}

And here's a full minimal working example for puppeteer that will print array [ 1, 2, 3 ]:

const puppeteer = require('puppeteer');

puppeteer.launch().then(async browser => {
  const page = await browser.newPage();

  boxes2 = [];

  const getData = async() => {
    return await page.evaluate(async () => {
        return await new Promise(resolve => {
          setTimeout(() => {
                resolve([1,2,3]);
          }, 3000)
      })
    })
  }  

  boxes2 = await getData();
  console.log(boxes2)

  await browser.close();
});
Vaviloff
  • 16,282
  • 6
  • 48
  • 56
  • I did that, but `boxes2` turns turns out `undefined` when I print it – Rohit Aug 20 '19 at 09:59
  • that's why I tried to `console.log` it just to verify if the promise returns the array – Rohit Aug 20 '19 at 10:03
  • Plus, if I put a `console.log` after the `resolve` statement, it gets executed. Does `resolve` not act like a `return` and get you out of the `Promise` block? – Rohit Aug 20 '19 at 10:13
  • 1
    Added a working example. If results are present in console.log at line 4, all you need to do is to return them **from** `page.evaluate` otherwise nothing is returned from `getData` – Vaviloff Aug 20 '19 at 14:11
  • 1
    And no, `resolve` does not interrupt execution flow inside of a promise. To do that you could use `return resolve()`, see [this answer](https://stackoverflow.com/questions/34668878/should-i-use-return-in-promise) – Vaviloff Aug 20 '19 at 14:15
  • Aha! `resolve([1,2,3)` gave me the array `[1,2,3]` while `resolve(boxes)` was giving me `undefined`. Don't think `Array.from(boxes)` was converting the node list into an array, but why?? Also, wrapping the query selector statement inside an `Array.from()` gave me undefined. Probably because I was grabbing each component. I took it a step further and went on to grab just the URLs of each video and put a `Array.from()` around it. Works now. – Rohit Aug 21 '19 at 06:14
  • 1
    The main issue with your code was that you didn't return data from page evaluate. Since my answer did show how to do it, could you mark it as the solution? – Vaviloff Aug 21 '19 at 06:34
  • 1
    done! :) thanks for also pointing me to the other answer on Promises. Peace! – Rohit Aug 21 '19 at 06:47
  • but I still didn't get how `boxes` was `undefined` when resolving `[1,2,3]` gave me the exact array, `[1,2,3]`. Check my answer. The commented line is where I grab all the components which gives a node list and in the line above, I get the URLs as an array. The commented line gives me `undefined` even if I do `Array.from()` on it. – Rohit Aug 21 '19 at 06:49
  • 1
    It they resolved as undefined, probably boxes *were* undefined, as to why exactly, can't suggest without seeing the site. Notice though that you didn't declare boxes (it seems from the sample) – Vaviloff Aug 21 '19 at 08:00
  • good catch, boxes wasn't declared. but what I was saying was that if I used the query selector in line 7 (in the code in my answer) instead of the one in line 6 and do a `console.log()` on the promise instead of returning it (line 2), it was printing out the node list in the browser and also by using `console.log()` before `resolve(boxes)`. So I'm pretty sure that `boxes` wasn't undefined. – Rohit Aug 21 '19 at 08:56
  • Using `resolve(Array.from(boxes))` or wrapping `await page.evaluate` with `Array.from()` didn't seem to convert the node list returned by the query selector into an array – Rohit Aug 21 '19 at 09:01
  • I came to this conclusion when I did `resolve[1,2,3]` like you said and got the exact same array outside the evaluate function, but it couldn't return `boxes` – Rohit Aug 21 '19 at 09:03
  • 1
    You can't return just DOM nodes even passed through Array.from: `If the function passed to the page.evaluate returns a non-Serializable value, then page.evaluate resolves to undefined` [docs](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pageevaluatepagefunction-args) Sorry, should have noticed that earlier :) – Vaviloff Aug 21 '19 at 09:06
  • Got it now. Thanks :) – Rohit Aug 21 '19 at 09:31
2
let videoURLs = await page.evaluate(async () => { // scroll till there's no more room to scroll or you get at least 250 boxes  
                    return await new Promise(resolve => {
                        var scrolledHeight = 0  
                        var distance = 100 
                        var timer = setInterval(() => {
                            boxes = Array.from(document.querySelectorAll("div.style-scope.ytd-item-section-renderer#contents > ytd-video-renderer > div.style-scope.ytd-video-renderer#dismissable a#video-title")).map(vid => vid.href)
                            // boxes = Array.from(document.querySelectorAll("div.style-scope.ytd-item-section-renderer#contents > ytd-video-renderer > div.style-scope.ytd-video-renderer#dismissable"))
                            var scrollHeight = document.documentElement.scrollHeight
                            window.scrollBy(0, distance)
                            scrolledHeight += distance
                            if(scrolledHeight >= scrollHeight || boxes.length >= 50){
                                clearInterval(timer)
                                resolve(boxes)
                            }
                        }, 500)
                    })
                })
console.log(videoURLs)
Rohit
  • 1,385
  • 2
  • 15
  • 21
1

To get parameters to work with a result back to here is what you need.

const results = await page.evaluate(new Function('name', "return new Promise(resolve => {resolve('done')});"), name);
Rick
  • 12,606
  • 2
  • 43
  • 41
  • Only answer that works for me here when i am running puppeteer inside express server. Still wondering why i get `Promise is not defined in evaluation script` when i dont use `new Function(..` Any idea why ? – Less Mar 01 '22 at 21:42
  • 1
    For future readers: Figured that out. Babel messed it up. Using backticks did the trick as suggested here https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md#code-transpilation-issues – Less Mar 01 '22 at 21:59
  • There's no need for `new Function` here. Puppeteer does that for you. You can just pass a string as the function body or a traditional function. – ggorlen Dec 21 '22 at 23:53