I am trying to scrape this site:
https://www.trafikverket.se/trafikinformation/tag/?Station=Stockholm%20C&ArrDep=departure
(I know I can try and use their API instead but it's horrendous and regardless, I wanted to learn more about Puppeteer and AWS)
This particular site uses its own data fetching which you have to wait for before starting to scrape. So I tried the options waitUntil: 'networkidle0'
and waitUntil: 'networkidle2'
. When developing on my machine it was the latter option that worked for me.
This is the code:
import cheerio from 'cheerio'
import chromium from 'chrome-aws-lambda'
const SITE_URL =
'https://www.trafikverket.se/trafikinformation/tag/?Station=Stockholm%20C&ArrDep=departure'
export async function handler(event) {
const browser = await chromium.puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless,
ignoreHTTPSErrors: true,
})
const page = await browser.newPage()
await page.goto(SITE_URL, { waitUntil: 'networkidle2', timeout: 120000 })
const html = await page.content()
const $ = cheerio.load(html)
const $trs = $('tr td .time-strikethrough').parent().parent()
console.log($trs)
// I scrape some HTML here
browser.close()
return {
statusCode: 200,
body: '',
}
}
I am using Node 12.4.0 locally and am using chrome-aws-lambda@10.1.0
and puppeteer-core@10.0.0
.
The problem is when I run this code on AWS Lambda (with Node 12.x). I keep getting timeout errors. Even though this code runs only for 1-2 seconds on my machine, it still doesn't terminate after 2 minutes on AWS. I had to add the timeout: 120000
(ms) option because the default puppeteer timeout is 30 seconds.
I have tried both networkidle0
and networkidle2
but neither worked.
The reason why I know it's the waitUntil
option that is causing problems is because I tested scraping a static site and that worked flawlessly:
await page.goto('https://lumtest.com/myip.json')
const body = await page.$eval('body', (element) => element.textContent)
console.log('body', body)