2

I am working with NodeJS and the Puppeteer library to load a website and then check if a certain text is displayed on the page. I would like to count the number of occurrences of this specific text. Specifically, I would like this search to work exactly in the same manner as how the Ctrl+F function works in Chrome or Firefox.

Here's the code I have so far:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // How do I count the occurrences of the specific text here?

  await browser.close();
})();

Can someone please help me with a solution on how to achieve this? Any help would be greatly appreciated.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
Caesar
  • 9,483
  • 8
  • 40
  • 66

3 Answers3

2
import puppeteer from 'puppeteer'

(async () => {
  const textToFind = 'domain'
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('https://example.com')

  const text = await page.evaluate(() => document.documentElement.innerText)

  const n = [...text.matchAll(new RegExp(textToFind, 'gi'))].length
  console.log(`${textToFind} appears ${n} times`)

  await browser.close()
})()
Eric Fortis
  • 16,372
  • 6
  • 41
  • 62
  • The problem with this solution is that it seems to be skipping over things like button, which ctrl+f on a browser would pick up. For example, if I try this on `google.com`, the text I get back doesn't include "Google Search" or "I am feeling lucky". This is what I get back locally: `About\nStore\nGmailImages\nSign in\n \nGoogle offered in: Français\nCanada\nAdvertising\nBusiness\nHow Search works\nPrivacy\nTerms\nSettings` – Caesar Apr 20 '23 at 04:30
  • @Caesar Have you tried `innerHTML` which Puppeteer can provide easily with `await page.content()`? Those texts are `value=""` properties. I doubt you'll get it _exactly_ like the Ctrl+F algorithm, because that probably has special sauce that's not necessarily naively replicable without knowing internals. Can you explain why it's so important that it's exactly like Ctrl+F? What's your [actual use case](https://meta.stackexchange.com/a/233676/399876)? – ggorlen Apr 20 '23 at 04:48
  • @ggorlen The problem with innerHtml is that it includes a lot of things that are not necessarily displayed on the page. In our case, we are trying to validate if a site is up and running, and validating certain texts are showing up on the page. The client are usually non-technical, and they expect something like ctrl+f searching. So for example, they might want to search that "Google search" exists on the page but not "Error". It's very luckily "Error" will show up on the innerHtml due to some script that has that name. – Caesar Apr 20 '23 at 04:56
  • 1
    Got it, that makes sense. It's probably not easy to isolate the user-visible attributes from the HTML, but a rough approximation might be adding values and placeholders to the text contents and using that as the "visible text content". As an example of the complexity, Ctrl+F doesn't include things that are in the HTML but hidden from view (i.e. `visibility: hidden`). How would you capture that in Puppeteer? It's not obvious. The algorithm could be like 50 lines of conditions to cover all edge cases. I think more specification is necessary. – ggorlen Apr 20 '23 at 05:03
1

As I mentioned in a comment, the Ctrl+f algorithm may not be as simple as you presume, but you may be able to approximate it by making a list of all visible, non-style/script/metadata values and text contents.

Here's a simple proof of concept:

const puppeteer = require("puppeteer"); // ^19.7.2

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const ua =
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
  await page.setUserAgent(ua);
  const url = "https://www.google.com";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.evaluate(() =>
    window.isVisible = e =>
      // https://stackoverflow.com/a/21696585/6243352
      e.offsetParent !== null &&
      getComputedStyle(e).visibility !== "hidden" &&
      getComputedStyle(e).display !== "none"
  );
  const excludedTags = [
    "head",
    "link",
    "meta",
    "script",
    "style",
    "title",
  ];
  const text = await page.$$eval(
    "*",
    (els, excludedTags) =>
      els
        .filter(e =>
          !excludedTags.includes(e.tagName.toLowerCase()) &&
          isVisible(e)
        )
        .flatMap(e => [...e.childNodes])
        .filter(e => e.nodeType === Node.TEXT_NODE)
        .map(e => e.textContent.trim())
        .filter(Boolean),
    excludedTags
  );
  const values = await page.$$eval("[value]", els =>
    els
      .filter(isVisible)
      .map(e => e.value.trim())
      .filter(Boolean)
  );
  const visible = [
    ...new Set([...text, ...values].map(e => e.toLowerCase())),
  ];
  console.log(visible);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Output:

[
  'about',
  'store',
  'gmail',
  'images',
  'sign in',
  'businesses and job seekers',
  'in your community',
  'are growing with help from google',
  'advertising',
  'business',
  'how search works',
  'carbon neutral since 2007',
  'privacy',
  'terms',
  'settings',
  'google search',
  "i'm feeling lucky"
]

Undoubtedly, this has some false positives and negatives, and I've only tested it on google.com. Feel free to post a counterexample and I'll see if I can toss it in.

Also, since we run two separate queries, then combine the results and dedupe, ordering of the text isn't the same as it appears on the page. You could query by *, [value] and use conditions to figure out which you're working with if this matters. I've assumed your final goal is just a true/false "does some text exist?" semantic.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
0

you can get all the text and then run regex or simple search.

const extractedText = await page.$eval('*', (el) => el.innerText);
console.log(extractedText);
const regx = new Regex('--search word--', 'g')
count = (extractedText.match(regx) || []).length;
console.log(count);
  • The problem with this solution is that it seems to be skipping over things like button, which ctrl+f on a browser would pick up. For example, if I try this on `google.com`, the text I get back doesn't include "Google Search" or "I am feeling lucky". This is what I get back locally: `About\nStore\nGmailImages\nSign in\n \nGoogle offered in: Français\nCanada\nAdvertising\nBusiness\nHow Search works\nPrivacy\nTerms\nSettings ` – Caesar Apr 20 '23 at 04:44