0

I'm trying to find the src for all the scripts in the DOM. There are around 77 scripts on the page and puppeteer returns 66 scripts. If I check the DOM there are 12 scripts with the async attribute and those are the exact ones that are missing. How can we get them?

Analyzer.js

Next.js page which basically takes input from the user for the site to scrape.

import React from 'react'
import { useState , useEffect} from 'react';

const Test = () => {

    const [websiteURL, setWebsiteURL] = useState('');

    async function submitURL(){
        const data = await fetch('api/scraper', {
            method: 'POST',
            headers: {
                'Content-Type' : 'application/json'
            },
            body: JSON.stringify({
                url : websiteURL
            })
        })
        const response = await data.json();
        console.log(response)
    }


  return (
    <div>
        <input type="text" value={websiteURL}  onChange={(e) => setWebsiteURL(e.target.value)} placeholder = "Enter URL" />
        <button onClick={submitURL}>Test</button>
    </div>
  )
}

export default Test

Scraper.js

Endpoint under the API folder that scrapes the scripts from the URL

export default async function test (req, res){
      const url = req.body.url
      const browser = await puppeteer.launch()
      const page = await browser.newPage()
      await page.goto(url, { waitUntil: 'networkidle0' })
      const data = await page.page.evaluate(
            () =>  Array.from(document.querySelectorAll('script'))
                .map(elem => elem.tagName)
             );

     console.log(data.length);
     await page.browser.close()
}

Package.json

{
  "name": "scraper",
  "version": "0.1.0",
  "private": true,
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start",
    "lint": "next lint"
  },
  "dependencies": {
    "cheerio": "^1.0.0-rc.12",
    "firebase": "^9.11.0",
    "graphql": "^16.6.0",
    "graphql-request": "^5.0.0",
    "mobile-friendly-test-npm": "^1.0.4",
    "moment": "^2.29.4",
    "next": "12.2.5",
    "puppeteer": "^18.2.1",
    "puppeteer-extra": "^3.3.4",
    "puppeteer-extra-plugin-stealth": "^2.11.1",
    "react": "18.2.0",
    "react-dom": "18.2.0",
    "react-firebase-hooks": "^5.0.3",
    "react-share": "^4.4.0"
  },
  "devDependencies": {
    "eslint": "8.23.0",
    "eslint-config-next": "12.2.5"
  }
}

This is a next.js project that can be run with npm run dev.

  • Can you make a sample page and provide a complete, runnable [mcve] using `setContent`, or share the URL? See also [Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors-when-asking-a-question) (that includes terminal output). Thanks. – ggorlen Oct 12 '22 at 15:12
  • I have made the changes, is it better? – Anshum Shailey Oct 12 '22 at 15:43
  • Thanks, but I still can't run this. I see there's some React code, but that's a few steps away from something Puppeteer can automate. I need a full workflow: package.json, a build command, HTML, a full site URL that Puppeteer is navigating to, etc. [Why do you need the src of script tags anyway](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/233676#233676)? – ggorlen Oct 12 '22 at 15:45
  • I have added in a few more details, please let me know if that's helpful – Anshum Shailey Oct 12 '22 at 15:53
  • Oh, I see--the React app is just a front-end for the scraper. I think you can omit that--all that it's doing is filling in a URL, so I doubt it's the problem. Hardcode in `req.body.url` and your Puppeteer code should be complete. I still don't know why you care about the script tags on this site, though. There could be a much easier way to do whatever it is you're trying to do. – ggorlen Oct 12 '22 at 15:58
  • Yep, it's just the scraper part I'm concerned about I'm selecting all scripts but the async ones, are not showing – Anshum Shailey Oct 12 '22 at 16:00
  • I realize you're concerned with selecting all scripts, but why do you want to do that? What data are you ultimately trying to get from these scripts or what high-level goal is selecting scripts supposed to achieve? – ggorlen Oct 12 '22 at 16:03
  • I'm trying to make a scraper that identifies if a site has google analytics or not, bu searching for googletagmanager in the script source, it is working fine for all sites where the script is not async, but failing in this case, the script tag is present if I inspect but if I view the page source it is not there – Anshum Shailey Oct 12 '22 at 16:03
  • Also googleadservices script tags – Anshum Shailey Oct 12 '22 at 16:04
  • OK--doesn't google tag manager attach to the window as `GoogleAnalyticsObject`, `googletag` or similar? You should also be able to intercept all script requests using `page.on("response", res => {})` then scan the code there rather than dealing with the DOM, which is messier. – ggorlen Oct 12 '22 at 16:04
  • I'm not very sure of that, but I'm also looking for a few more scripts that identify ither things such as facebook.connect scripts that are usually there if they are remarketing on facebook, it's not about this website but all websites in general, with the async tag I can't seem to access the scripts – Anshum Shailey Oct 12 '22 at 16:07
  • Did you see my response above? Just intercept the scripts rather than dealing with the DOM. All data that's transferred to the page comes through a response, so this should give you access to all of the content of every script that the page runs as soon as it arrives, without having to select anything from the DOM, which is prone to error. As I linked above, this seems like a potential [xy problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/233676#233676). – ggorlen Oct 12 '22 at 16:08
  • Oh alright, I'm not very sure how I can access that but I'll look into it, do let me know if you know of any resource I can look into, also a very big thank you for helping me out! – Anshum Shailey Oct 12 '22 at 16:10
  • No problem. I can add an answer, but which data is missing so I know that my interception approach collected whatever you weren't seeing with the `querySelectorAll` approach? In other words, what's the expected Google Tag-related output? – ggorlen Oct 12 '22 at 16:15
  • I want to check if the src of any script tag on the page include any of these static.ads-twitter.com, googleadserivces.com, googleads.g.doubleclick.net, connect.facebook.net, snap.licdn.com, google-analytics.com, googletagmanager.com – Anshum Shailey Oct 12 '22 at 16:17

1 Answers1

0

After discussion in the comments, it seems your goal is to analyze the contents of the scripts to check for certain substrings. Rather than doing this through the DOM, I would intercept the responses that match the "script" resource type, then test their raw contents. This should be much more reliable than dealing with selectors since all data the page uses has to come through a response before it has a chance of being attached to the DOM.

Here's an example which also tests the scripts' URLs for a substring match:

const puppeteer = require("puppeteer"); // ^18.0.5

let browser;
(async () => {
  browser = await puppeteer.launch({headless: false});
  const [page] = await browser.pages();

  const url = "https://www.stackoverflow.com";
  const targets = [
    "static.ads-twitter.com",
    "googleadserivces.com",
    "googleads.g.doubleclick.net",
    "connect.facebook.net",
    "snap.licdn.com",
    "google-analytics.com",
    "googletagmanager.com",
  ];
  const matches = Object.fromEntries(targets.map(e => [e, []]));
  const captureAnalytics = async res => {
    if (res.request().resourceType() === "script" || res.url() === url) {
      const text = await res.text();

      for (const e of targets) {
        if (res.url().includes(e) || text.includes(e)) {
          matches[e].push(res.url());
        }
      }
    }
  };
  page.on("response", captureAnalytics);
  await page.goto(url, {waitUntil: "networkidle0"});
  page.off("response", captureAnalytics);
  console.log(matches);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

Note that if there's a 500 ms delay in network activity, networkidle0 will resolve the promise, potentially leading to a false negative, so you may wish to extend the network monitor's timeout or other predicate beyond that. A simple sleep after the goto can be useful to see what can be collected (await new Promise(r => setTimeout(r, 10000));).

Also, many sites will detect and block you as a bot, so there's more work to be done on the site you mention for any method at all to work. See Why does headless need to be false for Puppeteer to work? for details.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Thank you very much for the answer, this seems to be working fine for the stackoverflow url but does not give a match for certain urls that have these substrings including the one I am testing on - https:// yfdecor.com – Anshum Shailey Oct 12 '22 at 17:03
  • Which URLs are you missing specifically? What's the expected output? – ggorlen Oct 12 '22 at 17:05
  • "googleadserivces.com", "connect.facebook.net", "googleadserivces.com" these are present in the script src, we can see them if we inspect the page, but in the output, I can only see the the url of the site itself, but in the case of stackoverflow we get 'https://www.googletagmanager.com/gtag/js?id=G-WCZ03SZFCQ' and the stack overflow url itself in the array – Anshum Shailey Oct 12 '22 at 17:10
  • If you run headfully, do you see the expected output? I suspect you're being detected as a bot on this particular page, which is a separate issue. Also, take note of the race condition I mentioned with `"networkidle0"`. You can try adding a predicate or sleeping (less ideal) until all of the responses you expect arrive. – ggorlen Oct 12 '22 at 17:12
  • On running headfully, I just see the browser open and close in 4-5 seconds, the output is just the url of the site. By adding a predicate do you mean adding a delay? Anything I can do about the bot issue, its my first post here – Anshum Shailey Oct 12 '22 at 17:18
  • No, predicate means some condition other than a delay. But sleeping is a good start, just to show that you can get the expected output (I can--I get 6 URLs starting with `['https://www.googletagmanager.com/gtm.js?id=GTM-TQFGMV4', 'https://connect.facebook.net/en_US/fbevents.js', 'https://www.googleadservices.com/pagead/conversion_async.js',]`). See [Why does headless need to be false for Puppeteer to work?](https://stackoverflow.com/questions/63818869/why-does-headless-need-to-be-false-for-puppeteer-to-work)? – ggorlen Oct 12 '22 at 17:18
  • Anyway, if your final goal is a boolean (does the site use any analytics or not), then the URL of the site itself should be enough to prove that it does, because that initial `.html` payload must have one of those substrings in it if it shows up in the result array. If you try https://example.com you should see an empty array. Notwithstanding the aforementioned race conditions and being blocked as a bot. – ggorlen Oct 12 '22 at 17:22
  • I do want a boolean value but an independent boolean value for each substring, I'm a bit new to puppeteer, apologies if I'm asking too many questions but could you explain a bit on how I should go about the sleep/predicate thing? – Anshum Shailey Oct 12 '22 at 17:28
  • Oh. It'd have been nice to know that up front. I'll update soon. The predicate is up to you based on your business requirements, which are unclear. An example predicate might be "check the first 50 requests, then stop", for example. There's no obviously "good" predicate I can think of other than sleeping. As you are probably learning, all sites are different, so it's difficult to pick a silver bullet that will work 100% of the time. Goals like "unambiguously determine if site is using analytics" are hard. Sleeping is `await new Promise(r => setTimeout(r, 10 * 1000))` to sleep for 10 seconds. – ggorlen Oct 12 '22 at 17:30
  • Okay I understood that, and where exactly do I add the sleep? – Anshum Shailey Oct 12 '22 at 17:39
  • After the `goto` and before `page.off`. Run headfully because your site is detecting you as a bot. See my updated code. – ggorlen Oct 12 '22 at 17:44
  • I am reading headfully, this is the output I'm getting : { 'static.ads-twitter.com': [], 'googleadserivces.com': [], 'googleads.g.doubleclick.net': [], 'connect.facebook.net': [], 'snap.licdn.com': [], 'google-analytics.com': [], 'googletagmanager.com': [] } – Anshum Shailey Oct 12 '22 at 17:52
  • For stackoverflow, this is what I get, 'static.ads-twitter.com': [], 'googleadserivces.com': [], 'googleads.g.doubleclick.net': [ 'https://www.googletagmanager.com/gtag/js?id=G-WCZ03SZFCQ' ], 'connect.facebook.net': [], 'snap.licdn.com': [], 'google-analytics.com': [ 'https://www.googletagmanager.com/gtag/js?id=G-WCZ03SZFCQ' ], 'googletagmanager.com': [ 'https://www.googletagmanager.com/gtag/js?id=G-WCZ03SZFCQ' ] – Anshum Shailey Oct 12 '22 at 17:52
  • It seems like you're being blocked even when you run headfully on your target site. If you `console.log(res.url())` you probably won't see much--just a small fraction of the normal responses. You can also verify this by playing with the page and see if the functionality is limited. You can try all of the usual ways of getting around the detection in the link--proxies, Puppeteer extra stealth plugin, etc, but it's pretty much out of scope for this post. – ggorlen Oct 12 '22 at 17:53
  • I'll look into these suggestions, thank you very much for your help, really grateful! – Anshum Shailey Oct 12 '22 at 17:55