While crawling a webpage the structure of the webpage keeps changing , I mean its dynamic which leads to a scenario where my crawler stops working . Is there a mechanism to identify webpage structural changes before running the full crawler so as to identify whether the structure has changed or not.
Asked
Active
Viewed 119 times
0
-
"My crawler stops working". Please provide a [mcve]. – mzjn Sep 28 '20 at 15:22
2 Answers
0
If you can run your own javascript code in the webpage you can use MutationObserver providing the ability to watch for changes being made to the DOM tree.
Something like:
waitForDomStability(timeout: number) {
return new Promise(resolve => {
const waitResolve = observer => {
observer.disconnect();
resolve();
};
let timeoutId;
const observer = new MutationObserver((mutationList, observer) => {
for (let i = 0; i < mutationList.length; i += 1) {
// we only care if new nodes have been added
if (mutationList[i].type === 'childList') {
// restart the countdown timer
window.clearTimeout(timeoutId);
timeoutId = window.setTimeout(waitResolve, timeout, observer);
break;
}
}
});
timeoutId = setTimeout(waitResolve, timeout, observer);
// start observing document.body
observer.observe(document.body, { attributes: true, childList: true, subtree: true });
});
}
I'm using this approach in the open source scraping extension get-set-fetch. For full code look at /packages/background/src/ts/plugins/builtin/FetchPlugin.ts from the repo.

a1sabau
- 49
- 3
-
I was thinking something like calculating the ASCII values of the web page and comparing it while i crawl the same page again . Do you think its feasible ? – Pradyumna Sep 28 '20 at 14:42
-
It's more clear now. I thought you want the page to become "stable" before scraping it. – a1sabau Sep 29 '20 at 15:07
0
You can certainly use "snapshots" for comparing 2 versions of the same page. I've implemented something similar to java String hashCode to achieve this.
Code in javascript:
/*
returns a dom element snapshot as innerText hash code
starting point is java String hashCode: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
keep everything fast: only work with a 32 bit hash, remove exponentiation
custom implementation: s[0]*31 + s[1]*31 + ... + s[n-1]*31
*/
function getSnapshot() {
const snapshotSelector = 'body';
const nodeToBeHashed = document.querySelector(snapshotSelector);
if (!nodeToBeHashed) return 0;
const { innerText } = nodeToBeHashed;
let hash = 0;
if (innerText.length === 0) {
return hash;
}
for (let i = 0; i < innerText.length; i += 1) {
// an integer between 0 and 65535 representing the UTF-16 code unit
const charCode = innerText.charCodeAt(i);
// multiply by 31 and add current charCode
hash = ((hash << 5) - hash) + charCode;
// convert to 32 bits as bitwise operators treat their operands as a sequence of 32 bits
hash |= 0;
}
return hash;
}
If you can't run javascript code in the page, you can use the entire html response as the content to be hashed in your favorite language.

a1sabau
- 49
- 3
-
In pyhton is easier as there is builtin hash support, see [this question](https://stackoverflow.com/questions/16008670/how-to-hash-a-string-into-8-digits) for some examples. – a1sabau Oct 02 '20 at 14:25