0

I have a simple puppeteer script to scrape an announces website. I need to get the content of the page and after I've inspected the DOM I'm able to see that all the contents will have the same class for the div that contain the link and the text. How I can get the contents of each div with a loop?

This is an example of the html structure of the page, there are about twentyfive divs with the same class, each one is an announcement.

<div class="container">
 <div class="item-card bordertop show-in-related-free-list">
<!-- link and text are nested inside here -->
 </div>
</div>

This is the JS code I have at the moment. I've created it using headless-recorder-v2 chrome extension.

const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({
    headless: false,
    slowMo: 300
})
const page = await browser.newPage()
const navigationPromise = page.waitForNavigation()

await page.goto('https://city.example.com/')

await page.setViewport({ width: 1280, height: 607 })

await page.waitForSelector('.bakec > #app > .alert > .btn')
await page.click('.bakec > #app > .alert > .btn')

await page.waitForSelector('.row > .col-md-4:nth-child(1) > .card > .cursor-pointer > .card-title-home')
await page.click('.row > .col-md-4:nth-child(1) > .card > .cursor-pointer > .card-title-home')

await navigationPromise

await page.waitForSelector('#lightbox-vm18 > .modal-dialog > .modal-content > .modal-footer > .btn-primary')
await page.click('#lightbox-vm18 > .modal-dialog > .modal-content > .modal-footer > .btn-primary')

await page.waitForSelector('.bakec > #app > main > .container')
await page.click('.bakec > #app > main > .container')

await page.waitForSelector('#app > main > .container > .item-card:nth-child(3) > .item-container')
// Here I want to loop over announces and store each link and title inside an array

//await page.click('#app > main > .container > .item-card:nth-child(3) > .item-container')

//await navigationPromise

//await browser.close()

UPDATE

I've added this lines of code to my script. I'm able to get an array of the desired elements but how I can loop them, will a foreEach loop do the trick or I need to use a for loop??

const nodes = await page.$$('.item-heading > .item-title > a')
const announces = []
nodes.forEach( (el) => {
    let href = el.getProperty('href')
    announces.push(href)
})
console.log(announces);

I get an array of this kind if I try to loop the nodes variable

[
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }, Promise { <pending> },
  Promise { <pending> }
]

2 Answers2

1

You can use page.$$(selector) to get all the elements that match a given CSS selector.

Then you loop over the elements and retrieve the property innerHTML to get the content of each div (elementHandle.getProperty(propertyName)).

Ben
  • 1,331
  • 2
  • 8
  • 15
0

el.getProperty returns a promise that you'll need to await. You could use console.log(await Promise.all(announces)) to await them all in parallel, or write a for .. of loop to run the promises sequentially. See Using async/await with a forEach loop for details.

Generally speaking, though, avoid element handles unless you need to dispatch trusted events on an array. They're inherently racy and harder to work with than evaluate-family calls.

Here's an example of getting text from multiple elements with page.$$eval:

const puppeteer = require("puppeteer"); // ^19.7.2

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url = "https://quotes.toscrape.com";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const text = await page.$$eval(
    ".quote .text",
    els => els.map(el => el.textContent)
  );
  console.log(text);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

This works for any property. If you're looking for hrefs, you can replace .textContent with .href or .getProperty("href"):

const hrefs = await page.$$eval(".tag", els => els.map(el => el.href));

Don't forget to waitForSelector if the elements are added by JS after the page load.


Further remarks on your code:

  • This code

    const navigationPromise = page.waitForNavigation()
    

    looks problematic. goto already waits for a navigation, so this seems either superfluous or potentially buggy. Set a new navigation alongside a click that triggers navigation, not before a goto. awaiting the same navigation multiple times probably doesn't do what you think it does--treat them as one-shot.

  • Avoid devtools-generated selectors.

  • Close the browser resource with a finally block so that proper cleanup occurs in the presence of an error.

ggorlen
  • 44,755
  • 7
  • 76
  • 106