1

I am making a webscraper in javascript (node) using puppeteer. I want to retrive the text of en element.

The selector has been copy-paste from the chrome dev tools, when I launch puppeteer headless:false the site loads correct.

'waitForSelector()' gives always this error message: UnhandledPromiseRejectionWarning: TimeoutError: waiting for selector `#petrolTable_data > tr:nth-child(3) > td:nth-child(2)` failed: timeout 30000ms exceeded this is my code:

const puppeteer = require('puppeteer')

async function scrape(){
    const browser = await puppeteer.launch({headless:false})
    const page = await browser.newPage()

    await page.goto('https://economie.fgov.be/nl/themas/energie/energieprijzen/maximumprijzen/officieel-tarief-van-de', 
        {waitUntil: 'networkidle2'})
    await page.click('#fedconsent > div.orejime-AppContainer > div > div > div > button')
    //await page.screenshot({ path: 'screenshot.png' })
    //#petrolTable_data > tr:nth-child(3) > td:nth-child(2)
    await page.waitForSelector('#petrolTable_data > tr:nth-child(3) > td:nth-child(2)')
    let el = await page.$("#petrolTable_data > tr:nth-child(3) > td:nth-child(2)")
    console.log(el)
    let text = await el.getProperty('textContent')
    console.log(text)
    browser.close()
}

scrape()
Bjop
  • 330
  • 4
  • 19
  • 1
    What data are you trying to get? Do you realize the table is in an iframe, `src="https://petrolprices.economie.fgov.be/petrolprices?locale=nl"`? – ggorlen Jul 21 '22 at 20:24
  • I did, but didn't know this made a difference. Thank you for pointing out that is can be an adventage. – Bjop Jul 22 '22 at 06:52

1 Answers1

1

The data you want is in an iframe, so you'd have to locate the frame first, then dip into it and query its contents. If you open the element inspector, iframe contents become unnaturally selectable. Assuming that the console translates 1:1 with Puppeteer is a common gotcha.

But an easier approach is to simply navigate directly to the frame source. This is faster, less work and more reliable, assuming the source won't change.

const puppeteer = require("puppeteer"); // ^19.6.3

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url =
    "https://petrolprices.economie.fgov.be/petrolprices/?locale=nl";
  await page.setRequestInterception(true);
  page.on("request", request => {
    request.url() === url ? request.continue() : request.abort();
  });
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const data = await page.$$eval("tr", els =>
    els
      .slice(1)
      .map(e =>
        [...e.querySelectorAll("td")]
          .slice(0, 2)
          .map(e => e.textContent),
      ),
  );
  console.table(data);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Output:

┌─────────┬──────────────────────────────────────────────┬──────────────────┐
│ (index) │                      0                       │        1         │
├─────────┼──────────────────────────────────────────────┼──────────────────┤
│    0    │             'Benzine 95 RON E10'             │ '1,9310  euro/l' │
│    1    │             'Benzine 98 RON E5'              │ '2,1670  euro/l' │
│    2    │                 'Diesel B7'                  │ '2,0790  euro/l' │
│    3    │ 'Gasolie verwarming 50S (minder dan 2000 l)' │ '1,2923  euro/l' │
│    4    │   'Gasolie verwarming 50S (vanaf 2000 l)'    │ '1,2605  euro/l' │
└─────────┴──────────────────────────────────────────────┴──────────────────┘

Since the data is statically available, even easier and faster is to skip Puppeteer completely and use fetch/cheerio:

const cheerio = require("cheerio"); // 1.0.0-rc.12

const url =
  "https://petrolprices.economie.fgov.be/petrolprices/?locale=nl";
fetch(url)
  .then(res => res.text())
  .then(html => {
    const $ = cheerio.load(html);
    const rows = [...$("table tr:has(td)")].map(e =>
      [...$(e).find("td:not(:last-child)")].map(e =>
        $(e).text().trim(),
      ),
    );
    console.table(rows);
  });

On my slow netbook with both scripts in the cache, Cheerio takes 2 seconds versus 6 seconds for Puppeteer.

If you don't have Node 18, install node-fetch or use Axios.

Generally speaking, I'm not a fan of browser-generated selectors because they're extremely sensitive; if one element changes unexpectedly in the chain, everything breaks. There are almost always more robust selectors you can choose. There are a few other antipatterns in your code, so I'll defer to a blog post of mine for elaboration if you're curious.

ggorlen
  • 44,755
  • 7
  • 76
  • 106