Tl; DR: How can I fetch the content of the dynamically loaded news article here.
Hey everyone. I'm currently building a chrome extension that needs to parse all the text on a page.
At first, this seems straight forward - on page load you can simply walk along the DOM and collect all the text nodes
const walker = document.createTreeWalker(elem, NodeFilter.SHOW_TEXT);
for (let n; (n = walker.nextNode());) {
... parse n ...
}
However, this approach is problematic when content is loaded dynamically after the page load. This dynamic content loading is very common with news websites.
Consider the following spanish news webpage: https://aristeguinoticias.com/1501/mexico/si-no-lo-iban-a-procesar-por-que-no-dejaron-a-cienfuegos-en-eu-mike-vigil/
All the main content is loaded dynamically after the page loads, so this approach does not get you the article content.
To resolve this, I am looking into using the JavaScript MutationObserver API. Ideally, I would get notified every time there is a change on the page, and then this would let me track all of the dynamically added content.
var observer = new MutationObserver(function (mutations) {
mutations.forEach(mutation => {
if (mutation.type == "characterData"){
console.log(mutation.target.nodeValue);
}
else if (mutation.type == "childList"){
mutation.addedNodes.forEach(node => {
console.log(node.nodeValue);
}
});
});
observer.observe(document, {
childList: true,
subtree: true,
characterData: true,
attributes: false
});
This approach is better. It prints all the texts in the comment section, the side bar, the header, etc.
However, it still does not get the main content (the actual news article).
Any ideas how to get the actual page content? This example is specific to that spanish news webpage above, but I have found the same issue in many websites.