As I mentioned in a comment, the Ctrl+f algorithm may not be as simple as you presume, but you may be able to approximate it by making a list of all visible, non-style/script/metadata values and text contents.
Here's a simple proof of concept:
const puppeteer = require("puppeteer"); // ^19.7.2
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const ua =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
await page.setUserAgent(ua);
const url = "https://www.google.com";
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() =>
window.isVisible = e =>
// https://stackoverflow.com/a/21696585/6243352
e.offsetParent !== null &&
getComputedStyle(e).visibility !== "hidden" &&
getComputedStyle(e).display !== "none"
);
const excludedTags = [
"head",
"link",
"meta",
"script",
"style",
"title",
];
const text = await page.$$eval(
"*",
(els, excludedTags) =>
els
.filter(e =>
!excludedTags.includes(e.tagName.toLowerCase()) &&
isVisible(e)
)
.flatMap(e => [...e.childNodes])
.filter(e => e.nodeType === Node.TEXT_NODE)
.map(e => e.textContent.trim())
.filter(Boolean),
excludedTags
);
const values = await page.$$eval("[value]", els =>
els
.filter(isVisible)
.map(e => e.value.trim())
.filter(Boolean)
);
const visible = [
...new Set([...text, ...values].map(e => e.toLowerCase())),
];
console.log(visible);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Output:
[
'about',
'store',
'gmail',
'images',
'sign in',
'businesses and job seekers',
'in your community',
'are growing with help from google',
'advertising',
'business',
'how search works',
'carbon neutral since 2007',
'privacy',
'terms',
'settings',
'google search',
"i'm feeling lucky"
]
Undoubtedly, this has some false positives and negatives, and I've only tested it on google.com. Feel free to post a counterexample and I'll see if I can toss it in.
Also, since we run two separate queries, then combine the results and dedupe, ordering of the text isn't the same as it appears on the page. You could query by *, [value]
and use conditions to figure out which you're working with if this matters. I've assumed your final goal is just a true/false "does some text exist?" semantic.