Puppeteer: scraping multiple urls one by one

Question

I am trying to scrape multiple URL one by one, then repeat the scrape after one minute.

But I keep getting two errors and was hoping for some help.

I got an error saying:

functions declared within loops referencing an outer scoped variable may lead to confusing semantics

And I get this error when I run the function / code:

TimeoutError: Navigation timeout of 30000 ms exceeded.

My code:

const puppeteer = require("puppeteer");

const urls = [
  'https://www.youtube.com/watch?v=cw9FIeHbdB8',
  'https://www.youtube.com/watch?v=imy1px59abE',
  'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
];

const scrape = async() => {
  let browser, page;

  try {
    browser = await puppeteer.launch({ headless: true });
    page = await browser.newPage();

    for (let i = 0; i < urls.length; i++) {
      const url = urls[i];
      await page.goto(`${url}`);
      await page.waitForNavigation({ waitUntil: 'networkidle2' });

      await page.waitForSelector('.view-count', { visible: true, timeout: 60000 });

      const data = await page.evaluate(() => { // functions declared within loops referencing an outer scoped on this line.
        return [
          JSON.stringify(document.querySelector('#text > a').innerText),
          JSON.stringify(document.querySelector('#container > h1').innerText),
          JSON.stringify(document.querySelector('.view-count').innerText),
          JSON.stringify(document.querySelector('#owner-sub-count').innerText)
        ];
      });

      const [channel, title, views, subs] = [JSON.parse(data[0]), JSON.parse(data[1]), JSON.parse(data[2]), JSON.parse(data[3])];
      console.log({ channel, title, views, subs });
    }
  } catch(err) {
    console.log(err);
  } finally {
    if (browser) {
      await browser.close();
    }
    await setTimeout(scrape, 60000); // repeat after one minute after all urls have been scrape.
  }
};

scrape();

I would really appreciate any help I could get.

Md Minhaz Ahamed · Answer 1 · 2021-08-20T04:29:38.963

This works. Putting the for loop in a Promise and waitUntil: "networkidle2" as an option when page.goto() resolves your problem. You don't need to generate a new browser each time, so it should be declared outside of the for loop.

const puppeteer = require("puppeteer");

const urls = [
  "https://www.youtube.com/watch?v=cw9FIeHbdB8",
  "https://www.youtube.com/watch?v=imy1px59abE",
  "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
];

const scrape = async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  new Promise(async (resolve, reject) => {
    for (url of urls) {
      // your timeout
      await page.waitForTimeout(6 * 1000);

      await page.goto(`${url}`, {
        waitUntil: "networkidle2",
        timeout: 60 * 1000,
      });

      await page.waitForSelector(".view-count", {
        waitUntil: "networkidle2",
        timeout: 60 * 1000,
      });

      const data = await page.evaluate(() => {
        return [
          JSON.stringify(document.querySelector("#text > a").innerText),
          JSON.stringify(document.querySelector("#container > h1").innerText),
          JSON.stringify(document.querySelector(".view-count").innerText),
          JSON.stringify(document.querySelector("#owner-sub-count").innerText),
        ];
      });

      const [channel, title, views, subs] = [
        JSON.parse(data[0]),
        JSON.parse(data[1]),
        JSON.parse(data[2]),
        JSON.parse(data[3]),
      ];
      console.log({ channel, title, views, subs });
    }
    resolve(true);
  })
    .then(async () => {
      await browser.close();
    })
    .catch((reason) => {
      console.log(reason);
    });
};

scrape();

#Update As per ggorlen suggestion, the below-refactored code should serve your problem. Comment in the code indicates the purpose of that line

const puppeteer = require("puppeteer");

const urls = [
    "https://www.youtube.com/watch?v=cw9FIeHbdB8",
    "https://www.youtube.com/watch?v=imy1px59abE",
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
];

const scrape = async () => {
    // generate a headless browser instance
    const browser = await puppeteer.launch({ headless: true });

    // used .entries to get the index and value
    for (const [index, url] of urls.entries()) {

        // generating a new page for each of the content
        const page = await browser.newPage();

        // your 60 timeout from 2nd index
        if (index > 0) await page.waitForTimeout(60 * 1000);

        // wait for the page response to available with 60 seconds timeout (error throw)
        await page.goto(`${url}`, {
            waitUntil: "networkidle2",
            timeout: 60 * 1000,
        });

        // wait for .view-count section to be available
        await page.waitForSelector(".view-count");

        // don't need json stringify or parse as puppeteer does so
        await page.evaluate(() =>
        ({
            channel: document.querySelector("#text > a").innerText,
            title: document.querySelector("#container > h1").innerText,
            views: document.querySelector(".view-count").innerText,
            subs: document.querySelector("#owner-sub-count").innerText
        })
        ).then(data => {
            // your scrapped success data
            console.log('response', data);
        }).catch(reason => {
            // your scrapping error reason
            console.log('error', reason);
        }).finally(async () => {
            // close your current page
            await page.close();
        })
    }

    // after looping through finally close the browser
    await browser.close();
};

scrape();

I really appreciate the help but. I just tried to add finally block after catch block but i sometimes get an error if `data` returns back undefined and then the finally block is not called. Would you know anyway to fix that? — , Aug 19 '21 at 12:40
Thanks for the answer! This is the right idea but the `JSON.stringify`/`parse` calls and `new Promise` aren't needed. ``${url}`` -> `url`, `then(async () => {` should be `.finally(() => browser?.close())` so it runs even if there's a throw, this only waits 6 seconds, not 60 seconds, and a few other minor nitpicks. Welcome to SO! — ggorlen, Aug 19 '21 at 14:39
@ggorlen I replaced the `then` with a `finally` after catch block but it ends up not calling the finally block if it throws an error. Would you know the reason why this is happening? — , Aug 19 '21 at 15:10

ggorlen · Answer 2 · 2021-08-19T16:50:47.220

I'd suggest a design like this:

const puppeteer = require("puppeteer");

const sleep = ms => new Promise(resolve => setTimeout(resolve), ms);

const scrapeTextSelectors = async (browser, url, textSelectors) => {
  let page;

  try {
    page = await browser.newPage();
    page.setDefaultNavigationTimeout(50 * 1000);
    page.goto(url);
    const dataPromises = textSelectors.map(async ({name, sel}) => {
      await page.waitForSelector(sel);
      return [name, await page.$eval(sel, e => e.innerText)];
    });
    return Object.fromEntries(await Promise.all(dataPromises));
  }
  finally {
    page?.close();
  }
};

const urls = [
  "https://www.youtube.com/watch?v=cw9FIeHbdB8",
  "https://www.youtube.com/watch?v=imy1px59abE",
  "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
];
const textSelectors = [
  {name: "channel", sel: "#text > a"},
  {name: "title", sel: "#container > h1"},
  {name: "views", sel: ".view-count"},
  {name: "subs", sel: "#owner-sub-count"},
];

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});

  for (;; await sleep(60 * 1000)) {
    const data = await Promise.allSettled(urls.map(url => 
      scrapeTextSelectors(browser, url, textSelectors)
    ));
    console.log(data);
  }
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

A few remarks:

This runs in parallel on the 3 URLs using Promise.allSettled. If you have more URLs, you'll want a task queue or run synchronously over the URLs with a for .. of loop so you don't outstrip the system's resources. See this answer for elaboration.
I use waitForSelector on each and every selector rather than just '.view-count' so you won't miss anything.
page.setDefaultNavigationTimeout(50 * 1000); gives you an adjustable 50-second delay on all operations.
Moving the loops to sleep and step over the URLs into the caller gives cleaner, more flexible code. Generally, if a function can operate on a single element rather than a collection, it should.
Error handling is improved; Promise.allSettled lets the caller control what to do if any requests fail. You might want to filter and/or map the data response to remove the statuses: data.map(({value}) => value).
Generally, return instead of console.log data to keep functions flexible. The caller can console.log in the format they desire, if they desire.
There's no need to do anything special in page.goto(url) because we're awaiting selectors on the very next line. "networkidle2" just slows things down, waiting for network requests that might not impact the selectors we're interested in.
JSON.stringify/JSON.parse is already called by Puppeteer on the return value of evaluate so you can skip it in most cases.
Generally, don't do anything but cleanup in finally blocks. await setTimeout(scrape, 60000) is misplaced.

I have a question what's with the question mark on page?.close and browser?.close. Is that a spelling mistake or am i missing something here? — , Aug 19 '21 at 15:45
It's called [optional chaining](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Optional_chaining). It's shorthand for roughly `page && page.close()` or `if (page) page.close()`. It's possible that a throw happens before the `page` or `browser` object are defined, so we don't want to attempt calling `undefined.close()` under these circumstances. — ggorlen, Aug 19 '21 at 16:28
Oh ok i have never seen that done before. Thanks for the information and link! — , Aug 19 '21 at 16:35

Puppeteer: scraping multiple urls one by one

2 Answers2