0

I am doings a web scraper for SPA's, however, on some, I run into errors, while it works on others. I am not sure if my targeted selector is wrong sometimes, or there is request throttling due to large data sets returned, or too many requests from the same IP, however, it is less data than the pages regular API queries.

I am using NodeJS, puppeteer, Express and as I said, on some web pages, it works, especially on smaller data sets.

The error handler reports "Error while fetching {my request} timeout 30000ms exceeded error ".

I have used the trouble shooting from other answers setting that parameter to zero.

await page.goto('https://www.autoscout24.ch/de/autos/bmw--3-series?yearfrom=2006&priceto=5000&make=9&model=46&vehtyp=10', {timeout: 0});

I have searched for solutions on SO and on blogs, but none of them work for me.

The targeted URL is https://www.autoscout24.ch/de/autos/bmw--3-series?yearfrom=2006&priceto=5000&make=9&model=46&vehtyp=10 , where I simply want the price value, maybe my puppeteer selector is wrong, the price is well down the DOM tree, see the image:

Screenshot console

The full code is this:

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
const path = require('path');
const router = express.Router();

app.set('view engine', 'ejs');



(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation();


await page.goto('https://www.autoscout24.ch/de/autos/bmw--3-series?yearfrom=2006&priceto=5000&make=9&model=46&vehtyp=10', {timeout: 0});

await page.setViewport({ width: 1440, height: 744 });
await navigationPromise;


await page.waitForSelector('d-inline-block font-weight-bold h2 mr-2 mb-0');
let CarPrices = await page.$$eval('d-inline-block font-weight-bold h2 mr-2 mb-0', price => {
  return price.map(prices => prices.innerText);
});

console.log(`Prices are: ${CarPrices.join(', ')}`);
await browser.close();
app.get('/', function(req, res) {
  res.render('pages/index', { key: `The prices for these cars are: ${CarPrices.join(', ')}` });
});


  app.use('/', router);
  app.listen(process.env.port || 3000);

  console.log('Running at Port 3000');
} catch (e) {
console.log(`Error while fetching prices ${e.message}`);
}
})();

As I suspected that might be the wrong selector, I ran the headless recorder extension, this suggested I need a click event on the first page to get access to the selector, so I updated the relevant code to this

 await page.waitForSelector('.vehicle-card-list > .vehicle-card > #analytics-intersect-observer-808 > #analytics-intersect-observer-807 > .base-nav-link')
 await page.click('.vehicle-card-list > .vehicle-card > #analytics-intersect-observer-808 > #analytics-intersect-observer-807 > .base-nav-link')

 await page.waitForSelector('.title-price-bar > .d-flex > div > .d-flex > .h1')
let CarPrices = await page.$$eval('.title-price-bar > .d-flex > div > .d-flex > .h1', price => {
  return price.map(prices => prices.innerText);
});

I am still getting the timeout error, no matter the selectors I choose.

ptts
  • 1,022
  • 8
  • 18
  • 1
    `const navigationPromise = page.waitForNavigation();` is redundant and might cause a race condition. `page.goto` already waits for navigation; just use that. I only see one page here -- is this one failing sometimes? – ggorlen Apr 03 '22 at 23:12
  • @ggorlen I have removed that and consequently // await navigationPromise; as well, I thought it is necessary boilerplate. Another question is, with Puppeteer, how do I target an element with multiple classes if I have no more accurate selectors? "d-inline-block, font-weight-bold, h2" or ".d-inline-block, .font-weight-bold, .h2" or ".d-inline-block .font-weight-bold .h2" or just"d-inline-block font-weight-bold h2" , the documentation is a bit thin on this, thank a lot btw , and indeed, it is just one page, but I would like all results from the SPA, this should be around 80 results. – ptts Apr 04 '22 at 04:43
  • @ggorlen I have updated the code a bit and tried other selectors and a click event to get to another page, but no luck. – ptts Apr 04 '22 at 06:41
  • 1
    I took a closer look. Looks like you're being intercepted by a captcha. You already figured it out, but ```d-inline-block font-weight-bold h2 mr-2 mb-0``` isn't a useful selector. That's specifying tag names, not classes. Classes need a `.` prefix, like `.foo`, to work in a selector. – ggorlen Apr 04 '22 at 17:53
  • @ggorlen thank you a lot, I thought there must be some throttling or rate limiting. I watched some tutorials and the docs, can you hint me what a useful selector would be? Something like "div>.myClass" etc? And if it is a captcha, I suppose there's no easy way to get around that? – ptts Apr 04 '22 at 22:02
  • 1
    For the captcha, see [How to deal with the captcha when doing Web Scraping in Puppeteer?](https://stackoverflow.com/questions/55493536/how-to-deal-with-the-captcha-when-doing-web-scraping-in-puppeteer) – ggorlen Apr 04 '22 at 22:06
  • @ggorlen , indeed, I ran it with headless set to false and seen the issue, it is much worse than just a captcha, it says to contact customer service if the captcha is not visible(it is not). Seeing as how the re captcha tool costs money, I will not bother, I tried to delay the requests too. It was a good learning experience , I learned precisely how to use the proper selectors. Thanks again, would set your reply as accepted answer. – ptts Apr 07 '22 at 21:04
  • 1
    Thanks, but I'd just call it a dupe of the above link regarding captcha since I can't really solve the problem. Glad it was educational. – ggorlen Apr 07 '22 at 21:21

1 Answers1

-2

Use it in your page.goto like: await page.goto('url'+tableCell04Val, {waitUntil: 'load', timeout: 0});

try this

  • This does not work either, it fails before the first click event on. the main overview page. – ptts Apr 04 '22 at 10:45