4

I've been making incremental progress, but I'm fairly stumped at this point.

This is the site I'm trying to download from https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp The reason I'm using Puppeteer is because I can't find a supported API to get this data (if there is one happy to try it) The link is "Download Raw Data"

My script runs to the end, but doesn't seem to actually download any files. I tried installing puppeteer-extra and setting the downloads path:

const puppeteer = require("puppeteer-extra");
const { executablePath } = require('puppeteer')

...

var dir = "/home/ubuntu/AirlineStatsFetcher/downloads";
    console.log('dir to set for downloads', dir);
    puppeteer.use(require('puppeteer-extra-plugin-user-preferences')
        (
            {
                userPrefs: {
                    download: {
                        prompt_for_download: false,
                        open_pdf_in_system_reader: true,
                        default_directory: dir,
                    },
                    plugins: {
                        always_open_pdf_externally: true
                    },
                }
            }));

    const browser = await puppeteer.launch({
        headless: true, slowMo: 100, executablePath: executablePath()
    });

...
    // Doesn't seem to work
    await page.waitForSelector('table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)');
    console.log('Clicking on link to download CSV');
    await page.click('table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)');

After a while I figured why not tried to build the full URL and then do a GET request but then i run into other problems (UNABLE_TO_VERIFY_LEAF_SIGNATURE). Before going down this route farther (which feels a little hacky) I wanted to ask advice here.

Is there something I'm missing in terms of configuration to get it to download?

1 Answers1

2

Downloading files using puppeteer seems to be a moving target btw not well supported today. For now (puppeteer 19.2.2) I would go with https.get instead.

"use strict";

const fs = require("fs");
const https = require("https");
// Not sure why puppeteer-extra is used... maybe https://stackoverflow.com/a/73869616/1258111 solves the need in future.
const puppeteer = require("puppeteer-extra");
const { executablePath } = require("puppeteer");

(async () => {
  puppeteer.use(
    require("puppeteer-extra-plugin-user-preferences")({
      userPrefs: {
        download: {
          prompt_for_download: false,
          open_pdf_in_system_reader: false,
        },
        plugins: {
          always_open_pdf_externally: false,
        },
      },
    })
  );

  const browser = await puppeteer.launch({
    headless: true,
    slowMo: 100,
    executablePath: executablePath(),
  });

  const page = await browser.newPage();
  await page.goto(
    "https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp ",
    {
      waitUntil: "networkidle2",
    }
  );

  const handle = await page.$(
    "table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)"
  );

  const relativeZipUrl = await page.evaluate(
    (anchor) => anchor.getAttribute("href"),
    handle
  );

  const url = "https://www.transtats.bts.gov/OT_Delay/".concat(relativeZipUrl);
  const encodedUrl = encodeURI(url);

  //Don't use in production
  https.globalAgent.options.rejectUnauthorized = false;

  https.get(encodedUrl, (res) => {
    const path = `${__dirname}/download.zip`;
    const filePath = fs.createWriteStream(path);
    res.pipe(filePath);
    filePath.on("finish", () => {
      filePath.close();
      console.log("Download Completed");
    });
  });

  await browser.close();
})();
stefan.seeland
  • 2,065
  • 2
  • 17
  • 29
  • 1
    That did the trick, thank you! And thanks for the insight about the state of file downloading from Puppeteer - weird that such a common action is not well supported. But your solution works in this instance, thanks so much for unblocking – ObjectNameDisplay Nov 18 '22 at 01:45