I'm using Puppeteer.js to crawl some pages. Currently, I'm using jsdom to perform a lot of queries on the DOM. Today it's impossible, as you can see in the following code, and explained in the comment:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
// Instructs the blank page to navigate a URL
await page.goto('https://www.google.com/');
const result = await page.evaluate(() => {});
await browser.close();
})();
Here I have the 'document' object available here. I want to evaluate it outside of this scope, since I have here a lot of investigation of the DOM and I don't want to mess the puppeteer code here. I can't call any outer function here.
const result = await page.evaluate(() => {});
Today, Instead of this, I need to export all the HTML page, and send it to external package like jsdom in other file:
const jsdom = require('jsdom');
const pageContent = await page.content();
const dom = new jsdom.JSDOM(pageContent);
const divs = dom.window.document.querySelectorAll('div');
// Rest of the long investigation here.
What I need?
const dom = await page.contentDOM();
const divs = dom.window.document.querySelectorAll('div');
// Rest of the long investigation here.
So basically, my question is, there is some function built-in in Puppeeter.js that expose the document root object outside the page.evaluate
, that will allow me to move the document to other function to investigate the DOM there, or I still need at this point to use external packages like jsdom?
Tried to look for answers here:
Getting DOM node text with Puppeteer and headless Chrome
Using puppeteer how do you get all child nodes of a node?
Getting all styles with devtool-protocol in puppeteer
Get DocType of an HTML as string with Javascript
Handling events from puppeteer's page context outside evaluate method
Headless Chrome ( Puppeteer ) - how to get access to document node element?
I recently opened issue on Puppeeter.js GitHub page but with no answer:
https://github.com/puppeteer/puppeteer/issues/6667
Thanks in advance.