44

I've been running the following code in order to download a csv file from the website http://niftyindices.com/resources/holiday-calendar:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();

await page.goto('http://niftyindices.com/resources/holiday-calendar');
await page._client.send('Page.setDownloadBehavior', {behavior: 'allow', 
downloadPath: '/tmp'})
await page.click('#exportholidaycalender');
await page.waitFor(5000);
await browser.close();
})();

with headless: false it works, it downloads the file into /Users/user/Downloads. with headless: true it does NOT work.

I'm running this on a macOS Sierra (MacBook Pro) using puppeteer version 1.1.1 which pulls Chromium version 66.0.3347.0 into .local-chromium/ directory and used npm init and npm i --save puppeteer to set it up.

Any idea whats wrong?

Thanks in advance for your time and help,

Antonio Gomez Alvarado
  • 1,842
  • 2
  • 13
  • 24
  • 4
    I've ran this with `--enable-logging` when creating the `browser` object and i'm seeing this during the download : `[0313/104723.451228:VERBOSE1:navigator_impl.cc(200)] Failed Provisional Load: data:application/csv;charset=utf-8,%22SR.%20NO.... error_description: , showing_repost_interstitial: 0, frame_id: 4` – Antonio Gomez Alvarado Mar 13 '18 at 08:49

9 Answers9

25

I spent hours poring through this thread and Stack Overflow yesterday, trying to figure out how to get Puppeteer to download a csv file by clicking a download link in headless mode in an authenticated session. The accepted answer here didn't work in my case because the download does not trigger targetcreated, and the next answer, for whatever reason, did not retain the authenticated session. This article saved the day. In short, fetch. Hopefully this helps someone else out.

const res = await this.page.evaluate(() =>
{
    return fetch('https://example.com/path/to/file.csv', {
        method: 'GET',
        credentials: 'include'
    }).then(r => r.text());
});
Justin
  • 1,341
  • 11
  • 24
  • 6
    this may work for some downloads, but doesn't work in my case where the server requires a post request and is careful about not returning contents as a response body, but instead as a file download with type octet stream. – nurettin Nov 10 '18 at 13:38
  • I was having a problem downloading a large text file (70MB) even with headless `false`. The page would never fully load. Using `fetch` worked like a charm. Thanks! – Jeff Kilbride Sep 03 '21 at 23:38
  • I had to say, thanks! This one really made my day. There are few ways to do this these days. *Note:* this one should be the correct answer if questions is modified to include credentials. – Martin Nov 30 '22 at 23:13
21

This page downloads a csv by creating a comma delimited string and forcing the browser to download it by setting the data type like so

let uri = "data:text/csv;charset=utf-8," + encodeURIComponent(content);
window.open(uri, "Some CSV");

This on chrome opens a new tab.

You can tap into this event and physically download the contents into a file. Not sure if this is the best way but works well.

const browser = await puppeteer.launch({
  headless: true
});
browser.on('targetcreated', async (target) => {
    let s = target.url();
    //the test opens an about:blank to start - ignore this
    if (s == 'about:blank') {
        return;
    }
    //unencode the characters after removing the content type
    s = s.replace("data:text/csv;charset=utf-8,", "");
    //clean up string by unencoding the %xx
    ...
    fs.writeFile("/tmp/download.csv", s, function(err) {
        if(err) {
            console.log(err);
            return;
        }
        console.log("The file was saved!");
    }); 
});

const page = await browser.newPage();
.. open link ...
.. click on download link ..
Sumit Mishra
  • 226
  • 2
  • 3
8

The problem is that the browser closes before download finished.

You can get the filesize and the name of the file from the response, and then use a watch script to check file size from downloaded file, in order to close the browser.

This is an example:

    const filename = "set this with some regex in response";
    const dir = "watch folder or file";
    
    // Download and wait for download
        await Promise.all([
            page.click('#DownloadFile'),
           // Event on all responses
            page.on('response', response => {
                // If response has a file on it
                if (response._headers['content-disposition'] === `attachment;filename=${filename}`) {
                   // Get the size
                    console.log('Size del header: ', response._headers['content-length']);
                    // Watch event on download folder or file
                     fs.watchFile(dir, function (curr, prev) {
                       // If current size eq to size from response then close
                        if (parseInt(curr.size) === parseInt(response._headers['content-length'])) {
                            browser.close();
                            this.close();
                        }
                    });
                }
            })
        ]);

Even that the way of searching in response can be improved though I hope you'll find this useful.

talsibony
  • 8,448
  • 6
  • 47
  • 46
  • Putting `page.on('response', response => {` inside a `Promise.all` doesn't make sense. `page.on` registers a handler and returns undefined, not a promise. – ggorlen Nov 16 '22 at 16:45
7

I found a way to wait for browser capability to download a file. The idea is to wait for response with predicate. In my case URL ends with '/data'.

I just didn't like to load file contents into buffer.

await page._client.send('Page.setDownloadBehavior', {
    behavior: 'allow',
    downloadPath: download_path,
});

await frame.focus(report_download_selector);
await Promise.all([
    page.waitForResponse(r => r.url().endsWith('/data')),
    page.keyboard.press('Enter'),
]);
Andrey Shorin
  • 71
  • 1
  • 1
  • This worked for me - thanks! Whatever it is about my bank, I couldn't get any of the other approaches to work. No matter how I attempted to intercept the request or make a separate request with the same headers etc, the backend seemed to somehow identify that it hadn't come from their frontend and returned an error page. This works though. – Jay Shark Oct 04 '20 at 11:20
2

I needed to download a file from behind a login, which was being handled by Puppeteer. targetcreated was not being triggered. In the end I downloaded with request, after copying the cookies over from the Puppeteer instance.

In this case, I'm streaming the file through, but you could just as easily save it.

    res.writeHead(200, {
        "Content-Type": 'application/octet-stream',
        "Content-Disposition": `attachment; filename=secretfile.jpg`
    });
    let cookies = await page.cookies();
    let jar = request.jar();
    for (let cookie of cookies) {
        jar.setCookie(`${cookie.name}=${cookie.value}`, "http://secretsite.com");
    }
    try {
        var response = await request({ url: "http://secretsite.com/secretfile.jpg", jar }).pipe(res);
    } catch(err) {
        console.trace(err);
        return res.send({ status: "error", message: err });
    }
2

One way I found was using addScriptTag method. Works in both headless either False or True

Using this any kind of webpage can be downloaded. Now considering that the webpage opens a link something like: https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4

The webpage, meaning the mp4 file will be downloaded, using below script;

    await page.addScriptTag({'content':'''
    function fileName(){
        link = document.location.href
        return link.substring(link.lastIndexOf('/')+1);
    }
    async function save() {
        bl = await fetch(document.location.href).then(r => r.blob()); 
        var a = document.createElement("a");
        a.href = URL.createObjectURL(bl);
        a.download = fileName();
        a.hidden = true;
        document.body.appendChild(a);
        a.innerHTML = "download";
        a.click();
    }
    save()
    '''
    })
Dharman
  • 30,962
  • 25
  • 85
  • 135
2

I had a more difficult variation of this, using Puppeteer Sharp. I needed both Headers and Cookies set before the download would start.

In essence, before the button click, I had to process multiple responses and handle a single response with the download. Once I had that particular response, I had to attach headers and cookies for the remote server to send the downloadable data in the response.

await using (var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = true, Product = Product.Chrome }))
await using (var page = await browser.NewPageAsync())
{
    ...
    // Handle multiple responses and process the Download
    page.Response += async (sender, responseCreatedEventArgs) =>
    {
        if (!responseCreatedEventArgs.Response.Headers.ContainsKey("Content-Type"))
            return;

        // Handle the response with the Excel download
        var contentType = responseCreatedEventArgs.Response.Headers["Content-Type"];
        if (contentType.Contains("application/vnd.ms-excel"))
        {
            string getUrl = responseCreatedEventArgs.Response.Url;

            // Add the cookies to a container for the upcoming Download GET request
            var pageCookies = await page.GetCookiesAsync();
            var cookieContainer = BuildCookieContainer(pageCookies);

            await DownloadFileRequiringHeadersAndCookies(getUrl, fullPath, cookieContainer, cancellationToken);
        }
    };

    await page.ClickAsync("button[id^='next']");

    // NEED THIS TIMEOUT TO KEEP THE BROWSER OPEN WHILE THE FILE IS DOWNLOADING!
    await page.WaitForTimeoutAsync(1000 * configs.DownloadDurationEstimateInSeconds);
}

Populate the Cookie Container like this:

private CookieContainer BuildCookieContainer(IEnumerable<CookieParam> cookies)
{
    var cookieContainer = new CookieContainer();
        
    foreach (var cookie in cookies)
    {
        cookieContainer.Add(new Cookie(cookie.Name, cookie.Value, cookie.Path, cookie.Domain));
    }

    return cookieContainer;
}

The details of DownloadFileRequiringHeadersAndCookies are here. If your needs to download a file are more simplistic, you can probably use the other methods mentioned on this thread, or the linked thread.

Cryptc
  • 2,959
  • 1
  • 18
  • 18
0

setDownloadBehavior works fine for headless: true mode, and file is eventually downloaded, but throws an exception when finished, so for my case a simple wrapper helps to forget about this issue and just gets the job done:

const fs = require('fs');    
function DownloadMgr(page, downloaddPath) {
    if(!fs.existsSync(downloaddPath)){
        fs.mkdirSync(downloaddPath);
    }
    var init = page.target().createCDPSession().then((client) => {
        return client.send('Page.setDownloadBehavior', {behavior: 'allow', downloadPath: downloaddPath})
    });
    this.download = async function(url) {
        await init;
        try{
            await page.goto(url);
        }catch(e){}
        return Promise.resolve();
    }
}

var path = require('path');
var DownloadMgr = require('./classes/DownloadMgr');
var downloadMgr = new DownloadMgr(page, path.resolve('./tmp'));
await downloadMgr.download('http://file.csv');
David Buck
  • 3,752
  • 35
  • 31
  • 35
Evgen
  • 1
-1

I have another solution to this problem, since none of the answers here worked for me.

I needed to log into a website, and download some .csv reports. Headed was fine, headless failed no matter what I tried. Looking at the Network errors, the download is aborted, but I couldn't (quickly) determine why.

So, I intercepted the requests and used node-fetch to make the request outside of puppeteer. This required copying the fetch options, body, headers and adding in the access cookie.

Good luck.