1

Want to pick all the JSON blocks out of the page. The code is:

for (el of document.querySelectorAll(['div[class*="json"]', 'script[type="text/javascript"]', 'script[type="application/json"]'])) {
    if (el.innerHTML) {
        let matches = el.innerHTML.match(/({(?:\s|\n)*["'](?:\s|\n|.)*?["']:(?:\s|\n)*["'[{](?:\s|\n|.)+?})(?:;|$)/g)
        console.log(matches)
    }
}

Usually, it 2-5 elements on the page. The problem is, that after such query the page just stops responding (even Chrome dev tools search doesn't work).

I assume query could cause some load (while CPU doesn't show it), so the question is: what could be the problem and how to optimize the expression to lower this load?

P.S. It's ok if the operation will take time, so maybe there's a way to await each operation to finish before starting another one, to distribute the load?

sortas
  • 1,527
  • 3
  • 20
  • 29
  • you could use JSON.parse in a try/catch group. if JSON.parse throws an error, it's not json – gui3 Aug 12 '20 at 20:23
  • 1
    Why are you looking for JSON in JavaScript? JSON is a subset of JavaScript object/array syntax, but they're not identical. – Barmar Aug 12 '20 at 20:33
  • Even if you find it, you won't be able to parse it if it uses JS features that aren't allowed in JSON, such as single quotes around strings, or object keys that aren't quoted. – Barmar Aug 12 '20 at 20:34
  • @Barmar because JSON is not always stored in `type="application/json"` scripts, quite often it's stored as a dictionary, saved to a variable, or just inside a `div`. The question is, technically, not about `JSON`, it's about page load and how to optimize it :) – sortas Aug 12 '20 at 20:38
  • @Barmar Opened the first site from the Google Ads, valid JSON is stored into a variable in the middle of the script: https://i.imgur.com/zs1V1Xz.png. – sortas Aug 12 '20 at 20:41
  • The problem I'm trying to point out is that when it's stored in variables in code, it usually won't be restricted to the JSON subset. For instance, it might be: `var foo = {a: 1};` that's not valid JSON. – Barmar Aug 12 '20 at 20:41
  • @Barmar sometimes it's stored without `"`, etc., I agree, so I'll miss it, for sure. But quite often it's not, and RegExp could get it, so yeah. – sortas Aug 12 '20 at 20:42
  • 2
    I tried your regexp at regex101.com. It doesn't match `["foo"]`, which is valid JSON. It looks like it only matches JSON objects, not arrays. OTOH, it matches invalid JSON like `{"x": {"a": ["foo"], "b":}` – Barmar Aug 12 '20 at 20:48
  • @Barmar also agree, will update the expression, thanks :) – sortas Aug 12 '20 at 20:50

4 Answers4

1

Regex runtime is non-polynomial which means that for complex pattern it can take a while!! you have two choices; either run regex off the main thread, to ensure page remains responsive or find a more efficient way to achieve what you are trying to do with regex or least find a better (less cpu-intensive) regex;

for the first choice you can use web workers, which is a clean solution, or you can make kinda hacky workaround and use setTimeout() or using a promise; but I strongly suggest you to use a web-worker if its browser support is ok for your use-case ( who cares about IE anyway? )

Here is a example of utiltizing Promise to keep the cpu-intensive task off the main thread:

    const inefficientPattern = /({(?:\s|\n)*(?:"|')(?:\s|\n|.)*?(?:"|'):(?:\s|\n)*(?:"|'|\[|{)(?:\s|\n|.)+?})(?:;|$)/g;
    for (let el of document.querySelectorAll(['div[class*="json"]', 'script[type="text/javascript"]', 'script[type="application/json"]'])) {
      if (el.innerHTML) {
        new Promise( function (resolve, reject) {
          resolve(el.innerHTML.match(inefficientPattern))
        }).then( matches => {
          console.log(matches)
        })
      }
    }

intersting stuff: promise callback is executed immediately; I was wrong check out this answer: Are JavaScript Promise asynchronous?

Mechanic
  • 5,015
  • 4
  • 15
  • 38
1

As other answers have pointed out, this is a complex regex that could be executed against large portions of the web page's source code. A possible workaround includes leveraging Browser's async powers using Promises or Web Workers to unfreeze the UI but I don't think you're interested in solving this problem specifically. It seems like you're trying to scrape web data so it wouldn't make a difference whether the UI is frozen during this process or not.

My suggestion is to divide to conquer this problem. Let's take each selector and address them individually.

script[type="application/json"]

This one seems to be pretty straightforward. You probably just need to grab its inner content and voila, you have a JSON.

div[class*="json"]

I believe this one is a non-standard way to specify the initial state for web pages. It would probably fall into the same parser as above. You probably just need to grab its inner text and try to parse it as JSON.

script[type="text/javascript"]

This is the trickiest part since we're not dealing with a JSON anymore but executable JavaScript which may contain JSON data or not. For this one, you could use a simplified regex but I'd go further and suggest something else.

You could inspect JavaScript objects and try to convert them to JSON. This could be easily done with built-in API or using JavaScript parsers (like js2py if you're using something like Scrapy, for example). I'm not sure about the performance of this task but I believe it would be quicker than a complex regex and it might be worth a try.

It would work for cases like var initialState = { ... }; but maybe could present some challenges when trying to deal with inline values like hypedFramework.init({ ... }). In the latter case, you would probably need some JavaScript parsing to isolate those values. But it's still possible. Take a quick look at https://esprima.org/demo/parse.html and see how it's able to extract Object Expressions from Function Arguments.

0

try using JSON.parse in a try/catch group, like this

for (el of document.querySelectorAll(['div[class*="json"]', 'script[type="text/javascript"]', 'script[type="application/json"]'])) {
    if (el.innerHTML) {
        try {
           let matches = JSON.parse(el.innerHTML)
           console.log(JSON.stringify(matches))
        catch (err) {
           console.log('element was not a json')
        }
    }
}

if the contents of the element are indeed valid JSON syntax, it will execute without throwing an error, otherwise you can do something special in the catch group

precision

this will not find bits of JSON in the element, the whole element needs to be valid JSON syntax

gui3
  • 1,711
  • 14
  • 30
  • I mean, JSON could be in the middle of the script, like, assigned to some variable so that it won't work. Good solution, just it'll miss some JSON data, so not appliable. – sortas Aug 12 '20 at 20:30
  • @sortas That sounds like some major logic issues then. Single Responsibility comes to mind. But there isn't enough information on why you need to do this in the 1st place. – Austin T French Aug 12 '20 at 20:32
  • @AustinTFrench Not sure I get the point, sorry :) I want to get JSON data from the page. Sometimes it's just ` – sortas Aug 12 '20 at 20:36
  • JSON: JavaScript Object Notation. Are you saying you are parsing Script tags for a JSON too??? Or there are scripts that run (in a script loaded via script tags) which return a JS object? Regardless, it feels like a disaster. And then you add a new problem, regex. – Austin T French Aug 12 '20 at 20:57
0

I think you can split the for loop into multiple iterations by using setTimeout. This way the browser could have time to do the rendering stuff between each call to your heavy regex parsing.

const results = [];
let checkAll = (elements) => {
  results.push(checkOne(elements[0]));
  if (elements.length > 1) {
    setTimeout(0, () => checkAll(elements.slice(1)));
  } else {
    // do something with results ...
  }
}