15

I have a small web scraping application that downloads multiple files from a web application where the URLs require visting the page.

It works fine if I keep the browser instance alive in between runs, but I want to close the instance in between runs. When I call browser.close() my downloads are stopped because the chrome instance is closed before the downloads have finished.

Does puppeteer provide a way to check if downloads are still active, and wait for them to complete? I've tried page.waitForNavigation({ waitUntil: "networkidle0" }) and "networkidle2", but those seem to wait indefinitely.


  • node.js 8.10
  • puppeteer 1.10.0
Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
Jiri
  • 177
  • 1
  • 1
  • 7
  • I remember doing this once with nightmarejs, I don't know if that's helpful or not. The core team decided it wasn't worth including so someone made an extra called nightmare-download-manager – pguardiario Nov 26 '18 at 08:06
  • Thanks @pguardiario, but that does not help me much, unfortunately. I don't want to switch to nightmare.js. – Jiri Nov 26 '18 at 11:27

10 Answers10

4

Update:

It's 2022. Use Playwright to get away from this mass. manage downloads

It also has 'smarter' locator, which examine selectors every time before click()


old version for puppeteer:

My solution is to use chrome's own chrome://downloads/ page to managing download files. This solution can be very easily to auto restart a failed download using chrome's own feature

This example is 'single thread' currently, because it's only monitoring the first item appear in the download manager page. But you can easily adapt it to 'infinite threads' by iterating through all download items (#frb0~#frbn) in that page, well, take care of your network:)

dmPage = await browser.newPage()
await dmPage.goto('chrome://downloads/')

await your_download_button.click() // start download

await dmPage.bringToFront() // this is necessary
await dmPage.waitForFunction(
    () => {
        // monitoring the state of the first download item
        // if finish than return true; if fail click
        const dm = document.querySelector('downloads-manager').shadowRoot
        const firstItem = dm.querySelector('#frb0')
        if (firstItem) {
            const thatArea = firstItem.shadowRoot.querySelector('.controls')
            const atag = thatArea.querySelector('a')
            if (atag && atag.textContent === '在文件夹中显示') { // may be 'show in file explorer...'? you can try some ids, classess and do a better job than me lol
                return true
            }
            const btn = thatArea.querySelector('cr-button')
            if (btn && btn.textContent === '重试') { // may be 'try again'
                btn.click()
            }
        }
    },
    { polling: 'raf', timeout: 0 }, // polling? yes. there is a 'polling: "mutation"' which kind of async
)
console.log('finish')
TeaDrinker
  • 119
  • 1
  • 8
4

I didn't like solutions that were checking DOM or file system for the file.

From Chrome DevTools Protocol documentation I found two events, Page.downloadProgress and Browser.downloadProgress. (Though Page.downloadProgress is marked as deprecated, that's the one that worked for me.)

This event has a property called state which tells you about the state of the download. state could be inProgress, completed and canceled.

You can wrap this event in a Promise to await it till the status changes to completed

async function waitUntilDownload(page, fileName = '') {
    return new Promise((resolve, reject) => {
        page._client().on('Page.downloadProgress', e => { // or 'Browser.downloadProgress'
            if (e.state === 'completed') {
                resolve(fileName);
            } else if (e.state === 'canceled') {
                reject();
            }
        });
    });
}

and await it as follows,

await waitUntilDownload(page, fileName);
B45i
  • 2,368
  • 2
  • 23
  • 33
  • That's good, but late. btw the time I was trying to use Page.downloadProgress, it shows `experimental`. Also your link is broken. – TeaDrinker Mar 31 '23 at 21:56
  • I've updated the link, But I didn't get what you meant by 'Late'. – B45i Apr 01 '23 at 04:48
  • Because `downloadProgress()` is now deprecated, which perhaps won't be supported in the near future I guess? Anyway, I'm confused about why google made this useful API deprecated. – TeaDrinker Apr 01 '23 at 11:53
  • There is now `Browser.downloadProgress`. (also in experimental state) – B45i Apr 03 '23 at 07:58
3

An alternative if you have the file name or a suggestion for other ways to check.


async function waitFile (filename) {

    return new Promise(async (resolve, reject) => {
        if (!fs.existsSync(filename)) {
            await delay(3000);    
            await waitFile(filename);
            resolve();
        }else{
          resolve();
        }

    })   
}

function delay(time) {
    return new Promise(function(resolve) { 
        setTimeout(resolve, time)
    });
}

Implementation:

var filename = `${yyyy}${mm}_TAC.csv`;
var pathWithFilename = `${config.path}\\${filename}`;
await waitFile(pathWithFilename);
Gustave Dupre
  • 101
  • 1
  • 4
3

You need check request response.

await page.on('response', (response)=>{ console.log(response, response._url)}

You should check what is coming from response then find status, it comes with status 200

lejlun
  • 4,140
  • 2
  • 15
  • 31
2

Using puppeteer and chrome I have one more solution which might help you.

If you are downloading the file from chrome it will always have ".crdownload" extension. And when file is completely downloaded that extension will vanish.

So, I am using recurring function and maximum number of times it can iterate, If it doesn't download the file in that time.. I am deleting it. And I am constantly checking a folder for that extention.

async checkFileDownloaded(path, timer) {
    return new Promise(async (resolve, reject) => {
        let noOfFile;
        try {
            noOfFile = await fs.readdirSync(path);
        } catch (err) {
            return resolve("null");
        }
        for (let i in noOfFile) {
            if (noOfFile[i].includes('.crdownload')) {
                await this.delay(20000);
                if (timer == 0) {
                    fs.unlink(path + '/' + noOfFile[i], (err) => {
                    });
                    return resolve("Success");
                } else {
                    timer = timer - 1;
                    await this.checkFileDownloaded(path, timer);
                }
            }
        }
        return resolve("Success");
    });
}
Anand Biradar
  • 81
  • 3
  • 11
2

Here is another function, its just wait for the pause button to disappear:

async function waitForDownload(browser: Browser) {
  const dmPage = await browser.newPage();
  await dmPage.goto("chrome://downloads/");

  await dmPage.bringToFront();
  await dmPage.waitForFunction(() => {
    try {
      const donePath = document.querySelector("downloads-manager")!.shadowRoot!
        .querySelector(
          "#frb0",
        )!.shadowRoot!.querySelector("#pauseOrResume")!;
      if ((donePath as HTMLButtonElement).innerText != "Pause") {
        return true;
      }
    } catch {
      //
    }
  }, { timeout: 0 });
  console.log("Download finished");
}
Zero14
  • 31
  • 3
1

Created simple await function that will check for file rapidly or timeout in 10 seconds

import fs from "fs";

awaitFileDownloaded: async (filePath) => {
    let timeout = 10000
    const delay = 200

    return new Promise(async (resolve, reject) => {
        while (timeout > 0) {
            if (fs.existsSync(filePath)) {
                resolve(true);
                return
            } else {
                await HelperUI.delay(delay)
                timeout -= delay
            }
        }
        reject("awaitFileDownloaded timed out")
    });
},
Delorean
  • 356
  • 3
  • 11
1

You can use node-watch to report the updates to the target directory. When the file upload is complete you will receive an update event with the name of the new file that has been downloaded.

Run npm to install node-watch:

npm install node-watch

Sample code:

const puppeteer = require('puppeteer');
const watch = require('node-watch');
const path = require('path');

// Add code to initiate the download ...
const watchDir = '/Users/home/Downloads'
const filepath = path.join(watchDir, "download_file");
(async() => {
    watch(watchDir, function(event, name) {
    if (event == "update") {
        if (name === filepath)) {
            browser.close(); // use case specific
            process.exit();  // use case specific
        }
    }
})
Amitabh
  • 162
  • 9
0

Tried doing an await page.waitFor(50000); with a time as long as the download should take.

Or look at watching for file changes on complete file transfer

Hellonearthis
  • 1,664
  • 1
  • 18
  • 26
  • thanks for that simple solution. `await page.waitFor(timeout)` works, but I'll try and build a more graceful solution inspired by your second suggestion. – Jiri Nov 26 '18 at 12:24
  • 1
    I've used your pointer to to file events and this answer on [check if file exists, if not wait until it exists](https://stackoverflow.com/questions/26165725/nodejs-check-file-exists-if-not-wait-till-it-exist) to implement a more graceful solution. I couldn't use the answer you pointed to directly, because I'm dealing with local files, not remote servers. But good inspiration nonetheless! If the download has finished, the file already exists. If not, it waits for the the temporary file used during download to be renamed to the target file name. – Jiri Nov 27 '18 at 08:39
  • `waitFor` is now deprecated, this shouldn't be the accepted answer anymore: https://github.com/puppeteer/puppeteer/issues/6214 – Mooncake Feb 07 '22 at 16:02
  • Even if you use the correct `waitForTimeout`, this is always a race condition. What if you hit some latency and the download takes 50001 ms instead of 50000? What if you download the whole thing in 1ms, then the script sits pointlessly for 49999 ms? There should be a better solution. – ggorlen Feb 07 '22 at 22:54
0

you could search in the download location for the extension the files have when still downloading 'crdownload' and when the download is completed the file is renamed with the original extension: from this 'video_audio_file.mp4.crdownload' turns into 'video_audio_file.mp4' without the 'crdownload' at the end

const fs = require('fs');
const myPath = path.resolve('/your/file/download/folder');
let siNo = 0;
function stillWorking(myPath) {
    siNo = 0;
    filenames = fs.readdirSync(myPath);
    filenames.forEach(file => {
        if (file.includes('crdownload')) {
            siNo = 1;
        }
    });
    return siNo;
}

Then you use is in an infinite loop like this and check very a certain period of time, here I check every 3 seconds and when it returns 0 which means there is no pending files to be fully downloaded.

while (true) {
    execSync('sleep 3');
    if (stillWorking(myPath) == 0) {
        await browser.close();
        break;
    }
}