1

I'm using Puppeteer.js to crawl some pages. Currently, I'm using jsdom to perform a lot of queries on the DOM. Today it's impossible, as you can see in the following code, and explained in the comment:

const puppeteer = require('puppeteer');
(async () => {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();
    // Instructs the blank page to navigate a URL
    await page.goto('https://www.google.com/');
    const result = await page.evaluate(() => {});
    await browser.close();
})();

Here I have the 'document' object available here. I want to evaluate it outside of this scope, since I have here a lot of investigation of the DOM and I don't want to mess the puppeteer code here. I can't call any outer function here.

const result = await page.evaluate(() => {});

Today, Instead of this, I need to export all the HTML page, and send it to external package like jsdom in other file:

const jsdom = require('jsdom');
const pageContent = await page.content();
const dom = new jsdom.JSDOM(pageContent);
const divs = dom.window.document.querySelectorAll('div');
// Rest of the long investigation here.

What I need?

const dom = await page.contentDOM();
const divs = dom.window.document.querySelectorAll('div');
    // Rest of the long investigation here.

So basically, my question is, there is some function built-in in Puppeeter.js that expose the document root object outside the page.evaluate, that will allow me to move the document to other function to investigate the DOM there, or I still need at this point to use external packages like jsdom?

Tried to look for answers here:
Getting DOM node text with Puppeteer and headless Chrome
Using puppeteer how do you get all child nodes of a node?
Getting all styles with devtool-protocol in puppeteer
Get DocType of an HTML as string with Javascript
Handling events from puppeteer's page context outside evaluate method
Headless Chrome ( Puppeteer ) - how to get access to document node element?

I recently opened issue on Puppeeter.js GitHub page but with no answer:
https://github.com/puppeteer/puppeteer/issues/6667

Thanks in advance.

Or Assayag
  • 5,662
  • 13
  • 57
  • 93

1 Answers1

1

I'm not an expert at scraping or anything like so I could be wrong, but I think puppeteer can do anything that jsdom can do, plus execute javascript. I have found these four functions really helpful

page.$$: gets elements using querySelectorAll and returns an array of ElementHandle to make usable in Node

page.$: gets element using querySelector and returns an ElementHandle to make usable in Node

page.$$eval: executes callback on querySelectorAll and returns the callback result

page.$eval: executes callback on querySelector and returns the callback result

If you want the html from the page I found this answer :

const renderedContent = await page.evaluate(() => new XMLSerializer().serializeToString(document));
PhantomSpooks
  • 2,877
  • 2
  • 8
  • 13
  • Sadly, this is not I'm looking for. I want to get the `document` with all the functions of 'getElementById` and `querySelectorAll`. This only gives me the string HTML. You can get it by the way, by `await page.content()`, it will be the same. – Or Assayag Dec 24 '20 at 18:41
  • 1
    Have you considered using page.exposeFunction to load your functions into the page so that when you call page.evaluate it would be easier to read? Here's the doc for page.exposeFunction https://pptr.dev/#?product=Puppeteer&version=v5.5.0&show=api-pageexposefunctionname-puppeteerfunction – PhantomSpooks Dec 27 '20 at 12:02
  • Even if this works, how does it answer my needs? I need an object like I get from jsDOM package - a root document with all the 'querySelectAll' and 'getElemenetById' kind of functions, outside the `page.evaluate` function. – Or Assayag Dec 27 '20 at 15:26
  • 1
    I thought the reason you were using jsdom was because you had a lot of dom queries that you did not want to nest in the `page.evaluate`. Have you considered just forgetting puppeteer and using `JSDOM.fromURL` – PhantomSpooks Dec 28 '20 at 10:27
  • I wish to combine all in one package. I need to perform crawling in specific parts, and perform a long DOM investigation in the other parts, but in the same URL. – Or Assayag Dec 28 '20 at 10:38