0

I am trying to scrape this site: https://www.trafikverket.se/trafikinformation/tag/?Station=Stockholm%20C&ArrDep=departure
(I know I can try and use their API instead but it's horrendous and regardless, I wanted to learn more about Puppeteer and AWS)

This particular site uses its own data fetching which you have to wait for before starting to scrape. So I tried the options waitUntil: 'networkidle0' and waitUntil: 'networkidle2'. When developing on my machine it was the latter option that worked for me.

This is the code:

import cheerio from 'cheerio'
import chromium from 'chrome-aws-lambda'

const SITE_URL =
  'https://www.trafikverket.se/trafikinformation/tag/?Station=Stockholm%20C&ArrDep=departure'

export async function handler(event) {
  const browser = await chromium.puppeteer.launch({
    args: chromium.args,
    defaultViewport: chromium.defaultViewport,
    executablePath: await chromium.executablePath,
    headless: chromium.headless,
    ignoreHTTPSErrors: true,
  })
  const page = await browser.newPage()

  await page.goto(SITE_URL, { waitUntil: 'networkidle2', timeout: 120000 })

  const html = await page.content()
  const $ = cheerio.load(html)
  const $trs = $('tr td .time-strikethrough').parent().parent()
  
  console.log($trs)
  // I scrape some HTML here

  browser.close()
  return {
    statusCode: 200,
    body: '',
  }
}

I am using Node 12.4.0 locally and am using chrome-aws-lambda@10.1.0 and puppeteer-core@10.0.0.

The problem is when I run this code on AWS Lambda (with Node 12.x). I keep getting timeout errors. Even though this code runs only for 1-2 seconds on my machine, it still doesn't terminate after 2 minutes on AWS. I had to add the timeout: 120000 (ms) option because the default puppeteer timeout is 30 seconds.

I have tried both networkidle0 and networkidle2 but neither worked.


The reason why I know it's the waitUntil option that is causing problems is because I tested scraping a static site and that worked flawlessly:

await page.goto('https://lumtest.com/myip.json')
const body = await page.$eval('body', (element) => element.textContent)
console.log('body', body)
Nermin
  • 749
  • 7
  • 17
  • You're probably being detected as a bot and blocked, if the behavior is only occuring on a specific site. Did you try [changing the user-agent header](https://stackoverflow.com/a/70936552/6243352)? If you're waiting for data, you can do it much more precisely with `waitForResponse` or `waitForSelector` rather than networkidle. As an aside, it's [odd to use Cheerio with Puppeteer](https://serpapi.com/blog/puppeteer-antipatterns/#using-a-separate-html-parser-with-puppeteer) since Puppeteer can already select data on the real page. What's your motivation for that extra step/dependency? – ggorlen May 07 '23 at 15:26

0 Answers0