1

I need to scoop off data from multiple HTML pages which I have downloaded to my computer.

All pages are built the same, meaning that the data I need to scoop has the same css classification on all the pages.

I could open each page manually and then open the chrome console and paste in a function to scoop info (= select the info that matches the specified class and id etc. and save it to a variable), but that obviously wouldn't be efficient. How do I let the computer know I want it to open each file, then execute the command in the chrome console and then save the output somewhere? So when I open the file the output of all the executions is there? If it is a hassle to write it all into a file, putting everything into an array/object which I could copy is fine too.

Edit: I can also access the pages on the internet and have all the URLs stored in an array.

Leroy qt
  • 13
  • 3

1 Answers1

0

Since the pages you want to grab data from can be accessed over the internet, it would probably be easiest to achieve what you're looking for with a userscript. Since the URLs you need are already in an array, it's simply a matter of requesting each URL, parsing it, and adding the scraped information to your results array or object.

Here's an example, using the URLs of some random SO questions. Let's say I wanted to get the asker's name of each question. This is available via the selector string #question .user-details > a.

Put the URL you want the userscript to run on in the @match metadata section. Due to the same-origin policy, this needs to be on the same domain as the URLs in your array. Because the example URLs I'm using are on https://stackoverflow.com/, the @match also needs to be something on https://stackoverflow.com/.

Put the asynchronous code into an async IIFE so we can use await easily, and then for each URL, fetch it, transform the response text into a document so its elements can be easily querySelected, select the appropriate element, and push it to the results array. At the end, console.log the results:

// ==UserScript==
// @name         scrape example
// @namespace    CertainPerformance
// @version      1
// @match        https://stackoverflow.com/questions/51868209/how-to-run-a-function-on-multiple-html-files-and-write-the-output-of-all-executi*
// @grant        none
// ==/UserScript==

const urls = [
  'https://stackoverflow.com/questions/313893/how-to-measure-time-taken-by-a-function-to-execute',
  'https://stackoverflow.com/questions/359788/how-to-execute-a-javascript-function-when-i-have-its-name-as-a-string',
  'https://stackoverflow.com/questions/432174/how-to-store-arbitrary-data-for-some-html-tags',
];

(async () => {
  const usernames = [];
  for (const url of urls) {
    const response = await fetch(url);
    const responseText = await response.text();
    const responseDocument = new DOMParser().parseFromString(responseText, 'text/html');
    const username = responseDocument.querySelector('#question .user-details > a').textContent;
    usernames.push(username);
  }
  console.log(usernames);
})();

To see this in action, install a userscript manager of your choice, such as Tampermonkey, install this script, navigate to the URL in the match metadata section (the URL of this page:

https://stackoverflow.com/questions/51868209/how-to-run-a-function-on-multiple-html-files-and-write-the-output-of-all-executi

), and open your console. The three usernames corresponding to those three question URLs should appear after a moment:

   ["Julius A", "Lightness Races in Orbit", "Community"]

If there are lots of links, you also might consider awaiting another Promise that resolves after, say, 5 seconds, on each iteration, to avoid hitting a server-side rate-limiter, eg

await new Promise(res => setTimeout(res, 1000));

If the amount of data you're scraping is significant, console.logging the results might not be easily accessible enough. One possible alternative is to put the stringified results into a new textarea, whose raw data can be more easily copied from:

const { body } = document;
const textarea = body.insertBefore(document.createElement('textarea'), body.children[0]);
textarea.value = JSON.stringify(usernames);

If the document is in an odd encoding, you may need to decode it before calling DOMParser, such as with TextDecoder. For example, for a page in windows-1255 encoding, you would await the arrayBuffer() called on the response and then decode it, like this:

for (const url of urls) {
  const response = await fetch(url);
  const responseBuffer = await response.arrayBuffer();
  const responseDecoded = new TextDecoder('windows-1255').decode(responseBuffer)
  const responseDocument = new DOMParser().parseFromString(responseDecoded, 'text/html');
  const username = responseDocument.querySelector('#jobsArr_0').textContent;
  usernames.push(username);
}

When used on the page you posted, this results in:

   ["אחמש"]

The #jobsArr_0 is just some element that contained Hebrew text - now, the characters aren't mangled anymore.

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • This is a good answer, however, parsing none english/numbers comes out like this: '����� �����'? Is there a way for to supply encoding/language to the DOM Parser? (I need it to get the correct keys for Hebrew) – Leroy qt Aug 16 '18 at 04:00
  • I can't replicate the problem. I tried the url https://stackoverflow.com/questions/33108388/using-hebrew-characters-in-string which contains hebrew in the first element matching `.question code`, but the hebrew displayed just fine. Can you post a site or some example HTML that replicates the issue? – CertainPerformance Aug 16 '18 at 04:11
  • Running `document.characterSet` in the console on one of the pages returns `"windows-1255`. Does this help? – Leroy qt Aug 16 '18 at 04:16
  • In `fetch`, you might try setting the `headers` `Content-Type` to end in `charset=windows-1255` kind of like you can see [here](https://github.com/github/fetch/issues/145), but that's just a wild guess. You also might examine the `innerHTML` instead of the `textContent`, it might be more informative, but that's also just a guess. Can you post the site in question, or an example HTML page that reproduces the problem, so I can try to debug it? – CertainPerformance Aug 16 '18 at 04:25
  • Here is a page I put on github: https://github.com/Leroyxx/examplee (Edit: By the way, you'll notice I need to use `.value()` instead of `.innerHTML()`, that doesn't solve the problem, but it does get the correct values) – Leroy qt Aug 16 '18 at 04:34
  • Thanks, now I can replicate the problem. I think I figured it out, call `.arrayBuffer()` instead of `.text()` and then decode it with `TextDecoder`. See edit – CertainPerformance Aug 16 '18 at 05:31