How to extract content from a dynamically loaded webpage (using MutationObservers)

Question

Tl; DR: How can I fetch the content of the dynamically loaded news article here.

Hey everyone. I'm currently building a chrome extension that needs to parse all the text on a page.

At first, this seems straight forward - on page load you can simply walk along the DOM and collect all the text nodes

const walker = document.createTreeWalker(elem, NodeFilter.SHOW_TEXT);
for (let n; (n = walker.nextNode());) {
    ... parse n ...
}

However, this approach is problematic when content is loaded dynamically after the page load. This dynamic content loading is very common with news websites.

Consider the following spanish news webpage: https://aristeguinoticias.com/1501/mexico/si-no-lo-iban-a-procesar-por-que-no-dejaron-a-cienfuegos-en-eu-mike-vigil/

All the main content is loaded dynamically after the page loads, so this approach does not get you the article content.

To resolve this, I am looking into using the JavaScript MutationObserver API. Ideally, I would get notified every time there is a change on the page, and then this would let me track all of the dynamically added content.

var observer = new MutationObserver(function (mutations) {
  mutations.forEach(mutation => {

    if (mutation.type == "characterData"){
      console.log(mutation.target.nodeValue);
    }

    else if (mutation.type == "childList"){
      mutation.addedNodes.forEach(node => {
        console.log(node.nodeValue);
    }
  });
});

observer.observe(document, {
  childList: true,
  subtree: true,
  characterData: true,
  attributes: false
});

This approach is better. It prints all the texts in the comment section, the side bar, the header, etc.

However, it still does not get the main content (the actual news article).

Any ideas how to get the actual page content? This example is specific to that spanish news webpage above, but I have found the same issue in many websites.

Thanks for the suggestion, this actually does not help either, unfortunately. The article content is not in any of the updates I get — James Dorfman, Jan 17 '21 at 19:46
It is there but your observer callback is incorrect: 1) nodeValue does nothing useful, 2) the added content may be inside a container node so you also need to use querySelector. — wOxxOm, Jan 17 '21 at 19:54
Thanks! 1) Does nodeValue not contain the content of the node? Shouldn't it return the text in the node? 2) What do you mean by a container node? Are you referring to a node in the mutation.addedNodes list? — James Dorfman, Jan 17 '21 at 20:14
See [this example](https://stackoverflow.com/a/39334319). As for nodeValue, it's generally useless. Use instead node.textContent or node itself. — wOxxOm, Jan 17 '21 at 20:26

How to extract content from a dynamically loaded webpage (using MutationObservers)

0 Answers0