1

I’m trying to webscrape a press site, open every link of the articles and get the data. I was able to webscrape with puppeteer but cannot upload it to fire base cloud storage. How do I do that every hour or so? I webscraped in asynchrones function and then called it in the cloud function: I used puppeteer to scrape the links of the articles from newsroom website and then used the links to get more information from the articles. I first had everything in a single async function but cloud functions threw an error that there should not be any awaits in a loop.

UPDATE:

I implanted the code above in a firebase function but still get no-await in loop error.

DErasmus
  • 19
  • 1
  • 7
  • I think it would be easier, if you shared your first version where you had everything in one function. Note: There are many unused imports (title, link, resolve, reject) at the start of your script. You might want to delete them. – stekhn Oct 06 '20 at 16:34

1 Answers1

1

There is a couple of things wrong here, but you are on a good path of getting this to work. The main problem is, that you can't have await within a try {} catch {} block. Asynchronous JavaScript has a different way of dealing with errors. See: try/catch blocks with async/await.

In your case, it's totally fine to write everything in one async function. Here is how I would do it:

async function scrapeIfc() {
  const completeData = [];
  const url = 'https://www.ifc.org/wps/wcm/connect/news_ext_content/ifc_external_corporate_site/news+and+events/pressroom/press+releases';

  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  await page.setDefaultNavigationTimeout(0);

  const links = await page.evaluate(() =>
    Array.from(document.querySelectorAll('h3 > a')).map(anchor => anchor.href)
  );

  for (const link of links) {
    const newPage = await browser.newPage();
    await newPage.goto(link);

    const data = await newPage.evaluate(() => {
      const titleElement = document.querySelector('td[class="PressTitle"] > h3');
      const contactElement = document.querySelector('center > table > tbody > tr:nth-child(1) > td');
      const txtElement = document.querySelector('center > table > tbody > tr:nth-child(2) > td');

      return {
        source: 'ITC',
        title: titleElement ? titleElement.innerText : undefined,
        contact: contactElement ? contactElement.innerText : undefined,
        txt: txtElement ? txtElement.innerText : undefined,
      }
    })

    completeData.push(data);
    newPage.close();
  }

  await browser.close();

  return completeData;
}

There is couple of other things you should note:

  • You have a bunch of unused import title, link, resolve and reject the head of your script, which might have been added automatically by your code editor. Get rid of them, as they might overwrite the real variables.
  • I changed your document.querySelectors to be more specific, as I couldn't select the actual elements from the ITC website. You might need to revise them.
  • For local development I use Google's functions-framework, which helps me to run and test the function locally before deploying. If you have errors on your local machine, you'll have error when deploying to Google Cloud.
  • (Opinion) If you don't need Firebase, I would run this with Google Cloud Functions, Cloud Scheduler and the Cloud Firestore. For me, this has been the go-to workflow for periodic web scraping.
  • (Opinion) Puppeteer might be overkill for scraping a simple static website, since it runs in a headless Browser. Something like Cheerio is much more lightweight and much faster.

Hope I could help. If you encounter other problems, let us know. Welcome to the Stack Overflow community!

stekhn
  • 1,969
  • 2
  • 25
  • 38