5

I am creating a screenshot generator using puppeteer and node js. It works fine for normal web pages, but for pdf pages it always gives the same error everytime I run it

Here's the code(first example from https://github.com/GoogleChrome/puppeteer)

const puppeteer = require('puppeteer');

(async () => {
    try {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf');
        await page.screenshot({ path: 'example.png' });
        await browser.close();
    } catch (err) {
        console.log(err);
    }
})();

The error that I get

Error: net::ERR_ABORTED at https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
    at navigate (C:\MEAN\puppeteer-demo\node_modules\puppeteer\lib\FrameManager.js:121:37)
    at process._tickCallback (internal/process/next_tick.js:68:7)
  -- ASYNC --
    at Frame.<anonymous> (C:\MEAN\puppeteer-demo\node_modules\puppeteer\lib\helper.js:110:27)
    at Page.goto (C:\MEAN\puppeteer-demo\node_modules\puppeteer\lib\Page.js:629:49)
    at Page.<anonymous> (C:\MEAN\puppeteer-demo\node_modules\puppeteer\lib\helper.js:111:23)
    at C:\MEAN\puppeteer-demo\index.js:7:20
    at process._tickCallback (internal/process/next_tick.js:68:7)

Any help is appreciated. I'm also open to any other possible solutions.

Gaurav Saini
  • 51
  • 1
  • 5
  • You won't be able to take a screenshot from a PDF because no target is created by Chromium. When Chromium loads a PDf it's loading a PDF viewer which is not a target developer tools can debug. – hardkoded May 13 '19 at 15:18

3 Answers3

4

Headless Chrome is not able to visit PDF pages and will throw the error Error: net::ERR_ABORTED as you are experiencing. Although you can visit a PDF document with headless: false, taking a screenshot will also fail, as the PDF is not a real website and actually rendered inside a separate view.

Alternative approach

What you can do instead, is download the page and use PDF.js to create an image of the page. You might want to check out other information on the topic of "pdf to image" or "pdf preview". There are multiple questions on stackoverflow (1, 2, ..) regarding that topic and also examples on the PDF.js page itself.

Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
  • thanks, I was looking for a way to download pdf, but this probably saves me a lot of time. – M4hd1 Aug 13 '19 at 15:56
  • You might also be able to just use PDF.js to do all the work so that you can still do puppeteer stuff in headless mode. You can use both puppeteer and PDF.js in the same script. You can `/\.pdf$/.test( url )` before picking which one to use. I haven't explored PDF.js enough to know all of the capabilities it has as far as downloading and images go, so I won't speak to that, but I've been able to use them in combination to do my own work. – knod Feb 25 '20 at 13:41
1

For anyone stumbling on this question now, I did it by using a combination of Puppeteer, EJS and PDF.js since puppeteer by itself does not view PDF files.

My approach was basically using EJS to dynamically add a URL which will be viewed through PDF.js and then puppeteer will take a screenshot of it.

Here's the JS part

const ejs = require('ejs');
const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({ 
        args: [
            '--disable-web-security',
            '--disable-features=IsolateOrigins',
            '--disable-site-isolation-trials'
        ]
    });
    const page = await browser.newPage();

    const url = "https://example.com/test.pdf";

    const html = await ejs.renderFile('./template.ejs', { data: { url } });

    await page.setContent(html);
    await page.waitForNetworkIdle();
    const image = await page.screenshot({ encoding: 'base64' });

    await browser.close();

    console.log('Image: ', image);
})();

I added chromium args in puppeteer launch to allow for no-cors loading of pdf file as per this answer.

Here's the EJS template

<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <style>
        body {
            width: 100vw;
            height: 100vh;
            margin: 0;
        }
        #page {
            display: flex;
            width: 100%;
            height: 100%;
        }
    </style>

    <title>Document</title>
</head>

<body>
    <canvas id="page"></canvas>
    <script src="https://unpkg.com/pdfjs-dist@2.0.489/build/pdf.min.js"></script>
    <script>
        (async () => {
            const pdf = await pdfjsLib.getDocument('<%= data.url %>');
            const page = await pdf.getPage(1);

            const viewport = page.getViewport(1);
        
            const canvas = document.getElementById('page');
            const context = canvas.getContext('2d');

            canvas.height = viewport.height;
            canvas.width = viewport.width;

            const renderContext = {
                canvasContext: context,
                viewport: viewport
            };

            page.render(renderContext);
        })();
    </script>
</body>

</html>

Do note that this code will take a screenshot of only the first page.

0

Chromium does not allow to open pdf files in headless true mode, use instead headless false mode. await puppeteer.launch({args: ['--no-sandbox'], headless: false })

divyanshu
  • 141
  • 5