0

I am creating a screen scraper that needs to scrape the content of a page and take a screenshot of it. For that I am using Puppeteer, but I am hitting a snag. When I try to call a function that runs page.screenshot inside of page.evaulate I am getting an error that the funtion is not defined.

Here is my code:

async function getContent(clink, ce, networkidle, host, filepath) {
        let browser = await puppeteer.launch();
        let cpage = await browser.newPage();
        await cpage.goto(clink, { waitUntil: networkidle });
        let content = await cpage.evaluate((clink, ce, networkidle, host, filepath, pubDate) => {
            let results = '';
            let enclurl = clink;
            takeScreenshot(enclurl, filepath, networkidle)
                .then(() => {
                    console.log("Screenshot taken");
                })
                .catch((err) => {
                    console.log("Error occured!");
                    console.dir(err);
                });
            results += '<title><![CDATA[' + 'test' + ']]</title>';
            results += '<description><![CDATA[' + '<img src="' + host + filepath.slice(1) + '">' + document.querySelector(ce).innerHTML + ']]</description>';
            results += '<link>' + clink + '</link>';
            results += '<guid>' + clink + '</guid>';
            results += '<pubDate>' + pubDate + '</pubDate>';
            return results;
        }, clink, ce, networkidle, host, filepath, pubDate);
        await cpage.close();
        await browser.close();
        return content;
    }

That code should return items before a RSS format xml file is created. The URLs of such files will then be added to WPRobot campaigns. The end goal will be a search engine the uses Wordpress to aggregate the main content of pages with full screenshots of the sources.

The takeScreenshot function is as follows:

async function takeScreenshot(enclurl, filepath, networkidle) {
        let browser = await puppeteer.launch();
        let page = await browser.newPage();
        await page.goto(enclurl, { waitUntil: networkidle });
        let buffer = await page.screenshot({
            path: filepath
        });

        await page.close();
        await browser.close();
    }

Take screenshot works just fine when called outside of page.evaluate. The exact error I get says "takeScreenshot is undefined." I have another function that parses RSS feeds and takes screenshots of their source URLs, but it does not use page.evaluate at all.

I have now added the call to takeScreenshot to an earlier part of my code right before getContent() called but now it seems getContent() always returns as undefined. My new getContent() reads:

 async function getContent(clink, ce, networkidle) {
        let browser = await puppeteer.launch();
        let cpage = await browser.newPage();
        await cpage.goto(clink, { waitUntil: networkidle });
        let content = await cpage.evaluate((ce) => {
            let cefc = ce.charAt(0);
            if (cefc != '.') {
                ce = '#' + ce;
            }
            console.log('ce=' + ce);
            let results = document.querySelector(ce).innerHTML;
            return results;
        }, ce);
        await cpage.close();
        await browser.close();
        return content;
    }

I am also not seeing console.log('ce=' + ce) being written to the log. After moving the console.log out of the page.evaluate loop it logged the appropriate value for the content which is the HTML of the element with the specified class. Despite that the value of return content remains undefined.

  • Maybe [`page.exposeFunction()`](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pageexposefunctionname-puppeteerfunction) can help it your case. – vsemozhebuty Mar 13 '21 at 13:44

1 Answers1

2

Page.evaluate has a strange and not intuitive way to work:

the code of the function ( in you case: (clink, ce, networkidle, host, filepath, pubDate) => {...} ) is NOT executed in your script. This function in serialized, and send to the headless browser, inside puppeteer.

If you want to call a function from inside the evaluate function, usually (but not in this case) you can use one of this tricks: How to pass a function in Puppeteers .evaluate() method?

BUT in this case... there is a problem! inside takeScreenshot there are other function that CAN'T BE inside the headless browser of puppeteer, that are puppeteer.launch(); etc. This functions require a lot of dependecies (and same executable)... and can't be passed.

To do what you need, move the screenshot part of your code out of evaluate:

async function getContent(clink, ce, networkidle, host, filepath) {
    let browser = await puppeteer.launch();
    let cpage = await browser.newPage();
    await cpage.goto(clink, { waitUntil: networkidle });
    let content = await cpage.evaluate((clink, ce, networkidle, host, filepath, pubDate) => {
        let results = '';
        let enclurl = clink;

        results += '<title><![CDATA[' + 'test' + ']]</title>';
        results += '<description><![CDATA[' + '<img src="' + host + '{REPL_ME}' + '">' + document.querySelector(ce).innerHTML + ']]</description>';
        results += '<link>' + clink + '</link>';
        results += '<guid>' + clink + '</guid>';
        results += '<pubDate>' + pubDate + '</pubDate>';
        return results;
    }, clink, ce, networkidle, host, filepath, pubDate);

    await takeScreenshot(enclurl, filepath, networkidle);
    content = content.replace('{REPL_ME}', filepath)   

    await cpage.close();
    await browser.close();
    return content;
}
  • Thanks, I've been doing that, so now I call takeScreenshot before getContent in my main function and just use the file path for the image URL in the RSS items. – PostAlmostAnything Mar 13 '21 at 03:41
  • Now I am working on a new question involving the use of querySelector() when the class attribute has more than one class such as class="text-left p-2". So far my efforts to select elements with classes like that are coming back as null. – PostAlmostAnything Mar 13 '21 at 03:42
  • I dont' speak english very well. If you want to select for example:

    and

    the selector is: ".c1, .c2" If you want to select

    the selector is ".c1.c2" Sorry if i not understand!

    – Massimo Rebuglio Mar 13 '21 at 03:51
  • Seems I spoke too soon about my solution working. Now the screenshots are taken just fine but the results of getContent() are always undefined no matter what the class value is. I am about to update my question with the new version of getContent() – PostAlmostAnything Mar 13 '21 at 04:56
  • You dont see console.log becouse the same reason: you are logging on the headless browser, not in your console. Try to return ce and print it out of the evaluate function to debug – Massimo Rebuglio Mar 13 '21 at 05:09
  • Inside the evaluate function: let results = ce; return results; and After: console.log(content) – Massimo Rebuglio Mar 13 '21 at 05:13
  • console.log(content) logs the appropriate class, but even though let result = ce and return result has been added what gets returned says "undefined" when it should return the name of the class that was logged as content – PostAlmostAnything Mar 13 '21 at 05:18
  • The code looks right. To debug in this case usually i try mi function in my browser console – Massimo Rebuglio Mar 13 '21 at 05:21
  • Ps. Take a look of $eval function of pupeeteer , It can make easier your work – Massimo Rebuglio Mar 13 '21 at 05:23
  • Could it be due to the HTML containing script tags? Some sort of XSS protection in Puppeteer that I don't know about. The page I am scraping has several blocks of third party ad code in it. – PostAlmostAnything Mar 13 '21 at 05:39
  • Not due to XSS, I changed the name of the element to something without a script tag, same problem. Then I tried changing the name of the variable since it looked like there was a variable by that name in what I thought was a different scope elsewhere in my code and I got the error "Assignment to constant variable" when trying to set pcontent = "test content" right before returning pcontent. – PostAlmostAnything Mar 13 '21 at 05:59
  • Then I tried changing return pcontent to return 'test content' and it still returns as undefined. – PostAlmostAnything Mar 13 '21 at 06:09
  • I tried declaring maincontent differently as let maincontent = '' then setting maincontent = await getContent but that changed nothing, so whatever is going on is sufficient to change the value of maincontent from an empty string to undefined but not sufficient to return the value of the pcontent variable as the value of maincontent after calling getContent – PostAlmostAnything Mar 13 '21 at 06:16