1

I'm trying to create a function that can capture the src attribute from a website. But all of the most common ways of doing so, aren't working.

This was my original attempt.

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  try {
    await page.setDefaultNavigationTimeout(0);
    await page.waitForTimeout(500);

    await page.goto(
      `https://www.sirved.com/restaurant/essex-ontario-canada/dairy-freez/1/menus/3413654`,
      {
        waitUntil: "domcontentloaded",
      }
    );

    const fetchImgSrc = await page.evaluate(() => {
      const img = document.querySelectorAll(
        "#menus > div.tab-content >div > div > div.swiper-wrapper > div.swiper-slide > img"
      );
      let src = [];
      for (let i = 0; i < img.length; i++) {
        src.push(img[i].getAttribute("src"));
      }
      return src;
    });

    console.log(fetchImgSrc);
 } catch (err) {
    console.log(err);
  }

  await browser.close();
})();

[];

In my next attempt I tried a suggestion and was returned an empty string.

    await page.setViewport({ width: 1024, height: 768 });
    
    const imgs = await page.$$eval("#menus img", (images) =>
      images.map((i) => i.src)
    );

    console.log(imgs);

And in my final attempt I fallowed another suggestion and was returned an array with two empty strings inside of it.

    const fetchImgSrc = await page.evaluate(() => {
      const img = document.querySelectorAll(".swiper-lazy-loaded");
      let src = [];
      for (let i = 0; i < img.length; i++) {
        src.push(img[i].getAttribute("src"));
      }
      return src;
    });

    console.log(fetchImgSrc);

In each attempt i only replaced the function and console log portion of the code. I've done a lot of digging and found these are the most common ways of scrapping an image src using puppeteer and I've used them in other ways but for some reason right now they aren't working for me. I'm not sure if I have a bug in my code or why it will not work.

2 Answers2

1

To return the src link for the two menu images on this page you can use

const fetchImgSrc = await page.evaluate(() => {
    const img = document.querySelectorAll('.swiper-lazy-loaded');
    let src = [];
    for (let i = 0; i < img.length; i++) {
       src.push(img[i].getAttribute("src"));
    }
    return src;
});

This gives us the expected output

['https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg', 'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg']
Himanshu Poddar
  • 7,112
  • 10
  • 47
  • 93
1

You have two issues here:

  1. Puppeteer by default opens the page in a smaller window and the images to be scraped are lazy loaded, while they are not in the viewport: they won't be loaded (not even have src-s). You need to set your puppeteer browser to a bigger size with page.setViewport.
  2. Element.getAttribute is not advised if you are working with dynamically changing websites: It will always return the original attribute value, which is an empty string in the lazy loaded image. What you need is the src property that is always up-to-date in the DOM. It is a topic of attribute vs property value in JavaScript.

By the way: you can shorten your script with page.$$eval like this:

await page.setViewport({ width: 1024, height: 768 })
const imgs = await page.$$eval('#menus img', images => images.map(i => i.src))
console.log(imgs)

Output:

[
  'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg',
  'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg'
]
theDavidBarton
  • 7,643
  • 4
  • 24
  • 51
  • This is a unique attempt but it still did not work for me. – Everett De Leon Aug 01 '22 at 03:53
  • I see now, that you are in headless mode, where indeed the lazy loading needs further triggers (e.g. scrolling down to the container). I will take a look later. your code is not buggy it will do what you wanted (if you add the bigger viewport). – theDavidBarton Aug 01 '22 at 08:14
  • If I ran the browser with a head, your saying your current function works? I don't mind doing it either way. – Everett De Leon Aug 01 '22 at 15:40
  • it actually scraped both images earlier. since today it only returns the first image’s src. so I think my solution is still not provides you most stable solution at this point. I will try to finetune it :) – theDavidBarton Aug 01 '22 at 16:42