I am creating a screen scraper that needs to scrape the content of a page and take a screenshot of it. For that I am using Puppeteer, but I am hitting a snag. When I try to call a function that runs page.screenshot inside of page.evaulate I am getting an error that the funtion is not defined.
Here is my code:
async function getContent(clink, ce, networkidle, host, filepath) {
let browser = await puppeteer.launch();
let cpage = await browser.newPage();
await cpage.goto(clink, { waitUntil: networkidle });
let content = await cpage.evaluate((clink, ce, networkidle, host, filepath, pubDate) => {
let results = '';
let enclurl = clink;
takeScreenshot(enclurl, filepath, networkidle)
.then(() => {
console.log("Screenshot taken");
})
.catch((err) => {
console.log("Error occured!");
console.dir(err);
});
results += '<title><![CDATA[' + 'test' + ']]</title>';
results += '<description><![CDATA[' + '<img src="' + host + filepath.slice(1) + '">' + document.querySelector(ce).innerHTML + ']]</description>';
results += '<link>' + clink + '</link>';
results += '<guid>' + clink + '</guid>';
results += '<pubDate>' + pubDate + '</pubDate>';
return results;
}, clink, ce, networkidle, host, filepath, pubDate);
await cpage.close();
await browser.close();
return content;
}
That code should return items before a RSS format xml file is created. The URLs of such files will then be added to WPRobot campaigns. The end goal will be a search engine the uses Wordpress to aggregate the main content of pages with full screenshots of the sources.
The takeScreenshot function is as follows:
async function takeScreenshot(enclurl, filepath, networkidle) {
let browser = await puppeteer.launch();
let page = await browser.newPage();
await page.goto(enclurl, { waitUntil: networkidle });
let buffer = await page.screenshot({
path: filepath
});
await page.close();
await browser.close();
}
Take screenshot works just fine when called outside of page.evaluate. The exact error I get says "takeScreenshot is undefined." I have another function that parses RSS feeds and takes screenshots of their source URLs, but it does not use page.evaluate at all.
I have now added the call to takeScreenshot to an earlier part of my code right before getContent() called but now it seems getContent() always returns as undefined. My new getContent() reads:
async function getContent(clink, ce, networkidle) {
let browser = await puppeteer.launch();
let cpage = await browser.newPage();
await cpage.goto(clink, { waitUntil: networkidle });
let content = await cpage.evaluate((ce) => {
let cefc = ce.charAt(0);
if (cefc != '.') {
ce = '#' + ce;
}
console.log('ce=' + ce);
let results = document.querySelector(ce).innerHTML;
return results;
}, ce);
await cpage.close();
await browser.close();
return content;
}
I am also not seeing console.log('ce=' + ce) being written to the log. After moving the console.log out of the page.evaluate loop it logged the appropriate value for the content which is the HTML of the element with the specified class. Despite that the value of return content remains undefined.