0

While crawling a webpage the structure of the webpage keeps changing , I mean its dynamic which leads to a scenario where my crawler stops working . Is there a mechanism to identify webpage structural changes before running the full crawler so as to identify whether the structure has changed or not.

mzjn
  • 48,958
  • 13
  • 128
  • 248
Pradyumna
  • 57
  • 6

2 Answers2

0

If you can run your own javascript code in the webpage you can use MutationObserver providing the ability to watch for changes being made to the DOM tree.

Something like:

waitForDomStability(timeout: number) {
  return new Promise(resolve => {

  const waitResolve = observer => {
    observer.disconnect();
    resolve();
  };

  let timeoutId;
  const observer = new MutationObserver((mutationList, observer) => {
    for (let i = 0; i < mutationList.length; i += 1) {
      // we only care if new nodes have been added
      if (mutationList[i].type === 'childList') {
        // restart the countdown timer
        window.clearTimeout(timeoutId);
        timeoutId = window.setTimeout(waitResolve, timeout, observer);
        break;
      }
    }
  });

  timeoutId = setTimeout(waitResolve, timeout, observer);

  // start observing document.body
  observer.observe(document.body, { attributes: true, childList: true, subtree: true });
  });
}

I'm using this approach in the open source scraping extension get-set-fetch. For full code look at /packages/background/src/ts/plugins/builtin/FetchPlugin.ts from the repo.

a1sabau
  • 49
  • 3
  • I was thinking something like calculating the ASCII values of the web page and comparing it while i crawl the same page again . Do you think its feasible ? – Pradyumna Sep 28 '20 at 14:42
  • It's more clear now. I thought you want the page to become "stable" before scraping it. – a1sabau Sep 29 '20 at 15:07
0

You can certainly use "snapshots" for comparing 2 versions of the same page. I've implemented something similar to java String hashCode to achieve this.

Code in javascript:

/*
returns a dom element snapshot as innerText hash code
starting point is java String hashCode: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
keep everything fast: only work with a 32 bit hash, remove exponentiation
custom implementation: s[0]*31 + s[1]*31 + ... + s[n-1]*31
*/
function getSnapshot() {
    const snapshotSelector = 'body';
    const nodeToBeHashed = document.querySelector(snapshotSelector);
    if (!nodeToBeHashed) return 0;

    const { innerText } = nodeToBeHashed;

    let hash = 0;
    if (innerText.length === 0) {
      return hash;
    }

    for (let i = 0; i < innerText.length; i += 1) {
      // an integer between 0 and 65535 representing the UTF-16 code unit
      const charCode = innerText.charCodeAt(i);

      // multiply by 31 and add current charCode
      hash = ((hash << 5) - hash) + charCode;

      // convert to 32 bits as bitwise operators treat their operands as a sequence of 32 bits
      hash |= 0;
    }

    return hash;
}

If you can't run javascript code in the page, you can use the entire html response as the content to be hashed in your favorite language.

a1sabau
  • 49
  • 3
  • In pyhton is easier as there is builtin hash support, see [this question](https://stackoverflow.com/questions/16008670/how-to-hash-a-string-into-8-digits) for some examples. – a1sabau Oct 02 '20 at 14:25