Since the pages you want to grab data from can be accessed over the internet, it would probably be easiest to achieve what you're looking for with a userscript. Since the URLs you need are already in an array, it's simply a matter of requesting each URL, parsing it, and adding the scraped information to your results array or object.
Here's an example, using the URLs of some random SO questions. Let's say I wanted to get the asker's name of each question. This is available via the selector string #question .user-details > a
.
Put the URL you want the userscript to run on in the @match
metadata section. Due to the same-origin policy, this needs to be on the same domain as the URLs in your array. Because the example URLs I'm using are on https://stackoverflow.com/
, the @match
also needs to be something on https://stackoverflow.com/
.
Put the asynchronous code into an async
IIFE so we can use await
easily, and then for each URL, fetch
it, transform the response text into a document so its elements can be easily querySelect
ed, select the appropriate element, and push it to the results
array. At the end, console.log
the results:
// ==UserScript==
// @name scrape example
// @namespace CertainPerformance
// @version 1
// @match https://stackoverflow.com/questions/51868209/how-to-run-a-function-on-multiple-html-files-and-write-the-output-of-all-executi*
// @grant none
// ==/UserScript==
const urls = [
'https://stackoverflow.com/questions/313893/how-to-measure-time-taken-by-a-function-to-execute',
'https://stackoverflow.com/questions/359788/how-to-execute-a-javascript-function-when-i-have-its-name-as-a-string',
'https://stackoverflow.com/questions/432174/how-to-store-arbitrary-data-for-some-html-tags',
];
(async () => {
const usernames = [];
for (const url of urls) {
const response = await fetch(url);
const responseText = await response.text();
const responseDocument = new DOMParser().parseFromString(responseText, 'text/html');
const username = responseDocument.querySelector('#question .user-details > a').textContent;
usernames.push(username);
}
console.log(usernames);
})();
To see this in action, install a userscript manager of your choice, such as Tampermonkey, install this script, navigate to the URL in the match
metadata section (the URL of this page:
https://stackoverflow.com/questions/51868209/how-to-run-a-function-on-multiple-html-files-and-write-the-output-of-all-executi
), and open your console. The three usernames corresponding to those three question URLs should appear after a moment:
["Julius A", "Lightness Races in Orbit", "Community"]
If there are lots of links, you also might consider await
ing another Promise
that resolves after, say, 5 seconds, on each iteration, to avoid hitting a server-side rate-limiter, eg
await new Promise(res => setTimeout(res, 1000));
If the amount of data you're scraping is significant, console.log
ging the results might not be easily accessible enough. One possible alternative is to put the stringified results into a new textarea
, whose raw data can be more easily copied from:
const { body } = document;
const textarea = body.insertBefore(document.createElement('textarea'), body.children[0]);
textarea.value = JSON.stringify(usernames);
If the document is in an odd encoding, you may need to decode it before calling DOMParser
, such as with TextDecoder
. For example, for a page in windows-1255
encoding, you would await the arrayBuffer()
called on the response and then decode
it, like this:
for (const url of urls) {
const response = await fetch(url);
const responseBuffer = await response.arrayBuffer();
const responseDecoded = new TextDecoder('windows-1255').decode(responseBuffer)
const responseDocument = new DOMParser().parseFromString(responseDecoded, 'text/html');
const username = responseDocument.querySelector('#jobsArr_0').textContent;
usernames.push(username);
}
When used on the page you posted, this results in:
["אחמש"]
The #jobsArr_0
is just some element that contained Hebrew text - now, the characters aren't mangled anymore.