0

I am trying to scrape the YouTube headline and link from a channel using Puppeteer. While executing the program, I am facing the Evaluation Error as following:

Error: Evaluation failed: TypeError: Cannot read properties of null (reading 'innerText')
    at pptr://__puppeteer_evaluation_script__:10:65
    at ExecutionContext._ExecutionContext_evaluate (E:\somoy\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:229:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async ExecutionContext.evaluate (E:\somoy\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:107:16)
    at async initiate (E:\somoy\appNew.js:45:20)
    at async E:\somoy\appNew.js:155:9
async function initiate() {
    const browser = await puppeteer.launch({ headless: false, defaultViewport: null, userDataDir: './userdata', executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe' });
    const page = await browser.newPage();
    page.setDefaultNavigationTimeout(0)
    await page.goto('https://www.youtube.com/@ProthomAlo/videos', { waitUntil: 'networkidle2' });
    await delay(5000);
    if (!fs.existsSync('storeLink.txt')) {
        //create new file if not exist
        fs.writeFileSync("storeLink.txt", '');
    }
    articleLinkarr = (fs.readFileSync('storeLink.txt', { encoding: 'utf8' })).split('\n')
    let articles = await page.evaluate(async (articleLinkarr) => {
        //console.log('Hello1')
        let arrObj = [];
        articles = document.querySelectorAll('.style-scope.ytd-rich-grid-media');

        for (let i = 0; i < articles.length; i++) {
            //for (let i = 0; i < 20; i++) {
                //const category = document.querySelector('.print-entity-section-wrapper.F93gk').innerText
                //const headline = articles[i].querySelector('div > h3').innerText
                const headline = articles[i].querySelector('h3').innerText
                const link = 'https://www.youtube.com' + articles[i].querySelector('a').getAttribute('href')
                // if (!(link.includes('video') || link.includes('fun') || link.includes('photo'))) {
                //     if (!articleLinkarr.includes(link)) {
                arrObj.push({ articleHeadline: headline, articleLink: link })
                //     }
                // }
    };
    return arrObj;
}, articleLinkarr)
}
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • It seems that you are just listing videos from a given YouTube channel, why not just using [YouTube Data API v3](https://developers.google.com/youtube/v3) [this way](https://stackoverflow.com/a/27872244)? – Benjamin Loison Dec 27 '22 at 11:49
  • @BenjaminLoison Thanks for your suggestion. This is just the initial. I have other goals later. – Lomat Haider Dec 27 '22 at 16:12

1 Answers1

0

Puppeteer doesn't seem necessary here if you just want the initial set of titles. There's a JSON blob in the static HTML which has the title list, so you can make a simple HTTP request to the URL and pull the blob out with an HTML parser, then walk the object structure.

const cheerio = require("cheerio"); // 1.0.0-rc.12

const url = "Your URL";

fetch(url) // Node 18 or install node-fetch
  .then(res => {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  })
  .then(html => {
    const $ = cheerio.load(html);
    const script = $(
      [...$("script")].find(e =>
        $(e).text().startsWith("var ytInitialData = {")
      )
    )
      .text()
      .slice(20, -1);
    const data = JSON.parse(script);
    const titles = [];
    const {contents} =
      data.contents.twoColumnBrowseResultsRenderer.tabs[1].tabRenderer
        .content.richGridRenderer;

    for (const c of contents) {
      if (!c.richItemRenderer) {
        continue;
      }

      const title =
        c.richItemRenderer.content.videoRenderer.title.runs[0].text;
      const url =
        c.richItemRenderer.content.videoRenderer.navigationEndpoint
          .commandMetadata.webCommandMetadata.url;
      titles.push({title, url});
    }

    console.log(titles);
  })
  .catch(err => console.error(err));

If you do want to use Puppeteer, you can select these titles and URLs with:

const puppeteer = require("puppeteer"); // ^19.0.0

const url = "Your URL";

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForSelector("#video-title-link");
  const titles = await page.$$eval("#video-title-link", els =>
    els.map(e => ({title: e.textContent, url: e.href}))
      .filter(e => e.url)
  );
  console.log(titles);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

For some reason, the ids aren't unique.

Although this is less code, this approach is much slower than fetch (~10x slower on my machine), although you can speed it up a bit by blocking irrelevant resources.


As an aside, always use const in front of your variables to avoid making them global.

page.setDefaultNavigationTimeout(0) is generally not a great pattern--this could hang forever. I'd set this to 3 or 4 minutes at most. If nav is taking that long, something is wrong and you should get that logged so you can take a look at it.

ggorlen
  • 44,755
  • 7
  • 76
  • 106