How to scrape image src the right way using puppeteer?

Question

I'm trying to create a function that can capture the src attribute from a website. But all of the most common ways of doing so, aren't working.

This was my original attempt.

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  try {
    await page.setDefaultNavigationTimeout(0);
    await page.waitForTimeout(500);

    await page.goto(
      `https://www.sirved.com/restaurant/essex-ontario-canada/dairy-freez/1/menus/3413654`,
      {
        waitUntil: "domcontentloaded",
      }
    );

    const fetchImgSrc = await page.evaluate(() => {
      const img = document.querySelectorAll(
        "#menus > div.tab-content >div > div > div.swiper-wrapper > div.swiper-slide > img"
      );
      let src = [];
      for (let i = 0; i < img.length; i++) {
        src.push(img[i].getAttribute("src"));
      }
      return src;
    });

    console.log(fetchImgSrc);
 } catch (err) {
    console.log(err);
  }

  await browser.close();
})();

[];

In my next attempt I tried a suggestion and was returned an empty string.

    await page.setViewport({ width: 1024, height: 768 });
    
    const imgs = await page.$$eval("#menus img", (images) =>
      images.map((i) => i.src)
    );

    console.log(imgs);

And in my final attempt I fallowed another suggestion and was returned an array with two empty strings inside of it.

    const fetchImgSrc = await page.evaluate(() => {
      const img = document.querySelectorAll(".swiper-lazy-loaded");
      let src = [];
      for (let i = 0; i < img.length; i++) {
        src.push(img[i].getAttribute("src"));
      }
      return src;
    });

    console.log(fetchImgSrc);

In each attempt i only replaced the function and console log portion of the code. I've done a lot of digging and found these are the most common ways of scrapping an image src using puppeteer and I've used them in other ways but for some reason right now they aren't working for me. I'm not sure if I have a bug in my code or why it will not work.

https://www.sirved.com/restaurant/essex-ontario-canada/dairy-freez/1/menus — Everett De Leon, Jul 31 '22 at 15:55
It does return you the source for the first element, can you tell us what exactly you are trying to do here, so that we can code accordingly — Himanshu Poddar, Jul 31 '22 at 16:00
I'm trying to return the src link for the two menu images on this page. But I'm not getting either returned for me in my code — Everett De Leon, Jul 31 '22 at 16:01
Hi @EverettDeLeon, can you try my answer and let me know if it works for you4 — Himanshu Poddar, Jul 31 '22 at 17:25

score 1 · Answer 1 · answered Jul 31 '22 at 17:25

1

To return the src link for the two menu images on this page you can use

const fetchImgSrc = await page.evaluate(() => {
    const img = document.querySelectorAll('.swiper-lazy-loaded');
    let src = [];
    for (let i = 0; i < img.length; i++) {
       src.push(img[i].getAttribute("src"));
    }
    return src;
});

This gives us the expected output

['https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg', 'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg']

answered Jul 31 '22 at 17:25

Himanshu Poddar

7,112
10
47
93

I tried that and had no luck – Everett De Leon Aug 01 '22 at 03:53
What output did you get when u tried this? – Himanshu Poddar Aug 01 '22 at 07:03
[" "] this is what is returns. – Everett De Leon Aug 01 '22 at 12:32

score 1 · Answer 2 · answered Jul 31 '22 at 17:33

1

You have two issues here:

Puppeteer by default opens the page in a smaller window and the images to be scraped are lazy loaded, while they are not in the viewport: they won't be loaded (not even have src-s). You need to set your puppeteer browser to a bigger size with page.setViewport.
Element.getAttribute is not advised if you are working with dynamically changing websites: It will always return the original attribute value, which is an empty string in the lazy loaded image. What you need is the src property that is always up-to-date in the DOM. It is a topic of attribute vs property value in JavaScript.

By the way: you can shorten your script with page.$$eval like this:

await page.setViewport({ width: 1024, height: 768 })
const imgs = await page.$$eval('#menus img', images => images.map(i => i.src))
console.log(imgs)

Output:

[
  'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg',
  'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg'
]

answered Jul 31 '22 at 17:33

theDavidBarton

7,643
4
24
51

This is a unique attempt but it still did not work for me. – Everett De Leon Aug 01 '22 at 03:53
I see now, that you are in headless mode, where indeed the lazy loading needs further triggers (e.g. scrolling down to the container). I will take a look later. your code is not buggy it will do what you wanted (if you add the bigger viewport). – theDavidBarton Aug 01 '22 at 08:14
If I ran the browser with a head, your saying your current function works? I don't mind doing it either way. – Everett De Leon Aug 01 '22 at 15:40
it actually scraped both images earlier. since today it only returns the first image’s src. so I think my solution is still not provides you most stable solution at this point. I will try to finetune it :) – theDavidBarton Aug 01 '22 at 16:42

How to scrape image src the right way using puppeteer?

2 Answers2