Crawling : Web scraping stops due to structural changes

Question

While crawling a webpage the structure of the webpage keeps changing , I mean its dynamic which leads to a scenario where my crawler stops working . Is there a mechanism to identify webpage structural changes before running the full crawler so as to identify whether the structure has changed or not.

"My crawler stops working". Please provide a [mcve]. – mzjn Sep 28 '20 at 15:22 — mzjn, Sep 28 '20 at 15:22

score 0 · Answer 1 · answered Sep 28 '20 at 13:49

If you can run your own javascript code in the webpage you can use MutationObserver providing the ability to watch for changes being made to the DOM tree.

Something like:

waitForDomStability(timeout: number) {
  return new Promise(resolve => {

  const waitResolve = observer => {
    observer.disconnect();
    resolve();
  };

  let timeoutId;
  const observer = new MutationObserver((mutationList, observer) => {
    for (let i = 0; i < mutationList.length; i += 1) {
      // we only care if new nodes have been added
      if (mutationList[i].type === 'childList') {
        // restart the countdown timer
        window.clearTimeout(timeoutId);
        timeoutId = window.setTimeout(waitResolve, timeout, observer);
        break;
      }
    }
  });

  timeoutId = setTimeout(waitResolve, timeout, observer);

  // start observing document.body
  observer.observe(document.body, { attributes: true, childList: true, subtree: true });
  });
}

I'm using this approach in the open source scraping extension get-set-fetch. For full code look at /packages/background/src/ts/plugins/builtin/FetchPlugin.ts from the repo.

I was thinking something like calculating the ASCII values of the web page and comparing it while i crawl the same page again . Do you think its feasible ? — Pradyumna, Sep 28 '20 at 14:42
It's more clear now. I thought you want the page to become "stable" before scraping it. — a1sabau, Sep 29 '20 at 15:07

score 0 · Answer 2 · answered Sep 29 '20 at 15:34

You can certainly use "snapshots" for comparing 2 versions of the same page. I've implemented something similar to java String hashCode to achieve this.

Code in javascript:

/*
returns a dom element snapshot as innerText hash code
starting point is java String hashCode: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
keep everything fast: only work with a 32 bit hash, remove exponentiation
custom implementation: s[0]*31 + s[1]*31 + ... + s[n-1]*31
*/
function getSnapshot() {
    const snapshotSelector = 'body';
    const nodeToBeHashed = document.querySelector(snapshotSelector);
    if (!nodeToBeHashed) return 0;

    const { innerText } = nodeToBeHashed;

    let hash = 0;
    if (innerText.length === 0) {
      return hash;
    }

    for (let i = 0; i < innerText.length; i += 1) {
      // an integer between 0 and 65535 representing the UTF-16 code unit
      const charCode = innerText.charCodeAt(i);

      // multiply by 31 and add current charCode
      hash = ((hash << 5) - hash) + charCode;

      // convert to 32 bits as bitwise operators treat their operands as a sequence of 32 bits
      hash |= 0;
    }

    return hash;
}

If you can't run javascript code in the page, you can use the entire html response as the content to be hashed in your favorite language.

In pyhton is easier as there is builtin hash support, see [this question](https://stackoverflow.com/questions/16008670/how-to-hash-a-string-into-8-digits) for some examples. — a1sabau, Oct 02 '20 at 14:25

Crawling : Web scraping stops due to structural changes

2 Answers2