27

I'm trying PDF.js.

My problem is that the Hello World demo does not support text selection. It will draw everything in a canvas without the text layer. The official PDF.js demo does support text selection but the code is too complex. I was wondering if somebody has a minimalistic demo with the text layer.

Andre Pena
  • 56,650
  • 48
  • 196
  • 243

3 Answers3

31

I have committed the example to Mozilla's pdf.js repository and it is available under the examples directory.

The original example that I committed to pdf.js no longer exists, but I believe it this example showcases text-selection. They have cleaned up and reorganized pdf.js and so the text-selection logic is encapsulated inside the text-layer, which can be created using a factory.

Specifically, PDFJS.DefaultTextLayerFactory takes care of setting up the basic text-selection stuff.


The following example is outdated; only leaving it here for historical reasons.

I have been struggling with this problem for 2-3 days now, but I finally figured it out. Here is a fiddle that shows you how to load a PDF with text-selection enabled.

The difficulty in figuring this out was that the text-selection logic was intertwined with the viewer code (viewer.js, viewer.html, viewer.css). I had to extricate relevant code and CSS out to get this to work (that JavaScript file is referenced in the file; you can also check it out here). The end result is a minimal demo that should prove helpful. To implement selection properly, the CSS that is in viewer.css is also extremely important as it sets up CSS styles for the divs that are eventually created and then used to get text selection working.

The heavy lifting is done by the TextLayerBuilder object, which actually handles the creation of the selection divs. You can see calls to this object from within viewer.js.

Anyway, here's the code including the CSS. Keep in mind that you will still need the pdf.js file. My fiddle has a link to a version that I built from Mozilla's GitHub repo for pdf.js. I didn't want to link to the repo's version directly since they are constantly developing it and it may be broken.

So without further ado:

HTML:

<html>
    <head>
        <title>Minimal pdf.js text-selection demo</title>
    </head>

    <body>
        <div id="pdfContainer" class = "pdf-content">
        </div>
    </body>
</html>

CSS:

.pdf-content {
    border: 1px solid #000000;
}

/* CSS classes used by TextLayerBuilder to style the text layer divs */

/* This stuff is important! Otherwise when you select the text, the text in the divs will show up! */
::selection { background:rgba(0,0,255,0.3); }
::-moz-selection { background:rgba(0,0,255,0.3); }

.textLayer {
    position: absolute;
    left: 0;
    top: 0;
    right: 0;
    bottom: 0;
    color: #000;
    font-family: sans-serif;
    overflow: hidden;
}

.textLayer > div {
    color: transparent;
    position: absolute;
    line-height: 1;
    white-space: pre;
    cursor: text;
}

.textLayer .highlight {
    margin: -1px;
    padding: 1px;

    background-color: rgba(180, 0, 170, 0.2);
    border-radius: 4px;
}

.textLayer .highlight.begin {
    border-radius: 4px 0px 0px 4px;
}

.textLayer .highlight.end {
    border-radius: 0px 4px 4px 0px;
}

.textLayer .highlight.middle {
    border-radius: 0px;
}

.textLayer .highlight.selected {
    background-color: rgba(0, 100, 0, 0.2);
}

JavaScript:

//Minimal PDF rendering and text-selection example using pdf.js by Vivin Suresh Paliath (http://vivin.net)
//This fiddle uses a built version of pdf.js that contains all modules that it requires.
//
//For demonstration purposes, the PDF data is not going to be obtained from an outside source. I will be
//storing it in a variable. Mozilla's viewer does support PDF uploads but I haven't really gone through
//that code. There are other ways to upload PDF data. For instance, I have a Spring app that accepts a
//PDF for upload and then communicates the binary data back to the page as base64. I then convert this
//into a Uint8Array manually. I will be demonstrating the same technique here. What matters most here is
//how we render the PDF with text-selection enabled. The source of the PDF is not important; just assume
//that we have the data as base64.
//
//The problem with understanding text selection was that the text selection code has heavily intertwined
//with viewer.html and viewer.js. I have extracted the parts I need out of viewer.js into a separate file
//which contains the bare minimum required to implement text selection. The key component is TextLayerBuilder,
//which is the object that handles the creation of text-selection divs. I have added this code as an external
//resource.
//
//This demo uses a PDF that only has one page. You can render other pages if you wish, but the focus here is
//just to show you how you can render a PDF with text selection. Hence the code only loads up one page.
//
//The CSS used here is also very important since it sets up the CSS for the text layer divs overlays that
//you actually end up selecting. 
//
//For reference, the actual PDF document that is rendered is available at:
//http://vivin.net/pub/pdfjs/TestDocument.pdf

var pdfBase64 = "..."; //should contain base64 representing the PDF

var scale = 1; //Set this to whatever you want. This is basically the "zoom" factor for the PDF.

/**
 * Converts a base64 string into a Uint8Array
 */
function base64ToUint8Array(base64) {
    var raw = atob(base64); //This is a native function that decodes a base64-encoded string.
    var uint8Array = new Uint8Array(new ArrayBuffer(raw.length));
    for(var i = 0; i < raw.length; i++) {
        uint8Array[i] = raw.charCodeAt(i);
    }

    return uint8Array;
}

function loadPdf(pdfData) {
    PDFJS.disableWorker = true; //Not using web workers. Not disabling results in an error. This line is
                                //missing in the example code for rendering a pdf.

    var pdf = PDFJS.getDocument(pdfData);
    pdf.then(renderPdf);                               
}

function renderPdf(pdf) {
    pdf.getPage(1).then(renderPage);
}

function renderPage(page) {
    var viewport = page.getViewport(scale);
    var $canvas = jQuery("<canvas></canvas>");

    //Set the canvas height and width to the height and width of the viewport
    var canvas = $canvas.get(0);
    var context = canvas.getContext("2d");
    canvas.height = viewport.height;
    canvas.width = viewport.width;

    //Append the canvas to the pdf container div
    jQuery("#pdfContainer").append($canvas);

    //The following few lines of code set up scaling on the context if we are on a HiDPI display
    var outputScale = getOutputScale();
    if (outputScale.scaled) {
        var cssScale = 'scale(' + (1 / outputScale.sx) + ', ' +
            (1 / outputScale.sy) + ')';
        CustomStyle.setProp('transform', canvas, cssScale);
        CustomStyle.setProp('transformOrigin', canvas, '0% 0%');

        if ($textLayerDiv.get(0)) {
            CustomStyle.setProp('transform', $textLayerDiv.get(0), cssScale);
            CustomStyle.setProp('transformOrigin', $textLayerDiv.get(0), '0% 0%');
        }
    }

    context._scaleX = outputScale.sx;
    context._scaleY = outputScale.sy;
    if (outputScale.scaled) {
        context.scale(outputScale.sx, outputScale.sy);
    }     

    var canvasOffset = $canvas.offset();
    var $textLayerDiv = jQuery("<div />")
        .addClass("textLayer")
        .css("height", viewport.height + "px")
        .css("width", viewport.width + "px")
        .offset({
            top: canvasOffset.top,
            left: canvasOffset.left
        });

    jQuery("#pdfContainer").append($textLayerDiv);

    page.getTextContent().then(function(textContent) {
        var textLayer = new TextLayerBuilder($textLayerDiv.get(0), 0); //The second zero is an index identifying
                                                                       //the page. It is set to page.number - 1.
        textLayer.setTextContent(textContent);

        var renderContext = {
            canvasContext: context,
            viewport: viewport,
            textLayer: textLayer
        };

        page.render(renderContext);
    });
}

var pdfData = base64ToUint8Array(pdfBase64);
loadPdf(pdfData);    
Vivin Paliath
  • 94,126
  • 40
  • 223
  • 295
  • 1
    Wow.. you did a great job! It will sure help many others. The difficulties I had were exactly these you described. Thank you for for putting this demo together. – Andre Pena Jun 07 '13 at 13:33
  • Your fiddle works fine, but when I copy it to the html file, it did not working :( What else files should I include? – Damjan Pavlica Jan 09 '15 at 14:41
  • @DamjanPavlica Take a look at the console to see what it's complaining about. You need to have pdf.js available. – Vivin Paliath Jan 09 '15 at 14:51
  • First it was missing pdf.js, so I inlude it. Then it complained about jQuery, so I inlude it. Now I get "ReferenceError: getOutputScale is not defined". – Damjan Pavlica Jan 09 '15 at 15:06
  • Can you please tell me why this dont work? I have followed your example, there'no problem in the console, but nothing on the screen: http://znaci.net/damjan/probe/pdf/proba.html – Damjan Pavlica Jan 16 '15 at 20:58
  • @DamjanPavlica Which version of pdf.js are you using? From my understanding, the latest version has some changes which changed the signature of `TextLayerBuilder`. – Vivin Paliath Jan 20 '15 at 15:40
  • Thanks for the answer, but it is ok now, I succeed to create simple example. – Damjan Pavlica Jan 21 '15 at 08:45
  • 2
    @VivinPaliath, thanks for the fiddle. Is it possible to give pdf path instead of converting it to `base64` ? Also can we provide `prev/next` options with the same ? – Slimshadddyyy Jan 23 '15 at 11:14
  • i would also be interested in giving a path instead of base64. is this possible? if not, how can i convert it? – vtni Feb 13 '15 at 01:57
  • @Slimshadddyyy, vtni yes, it is possible. I'd take a look at the documentation for pdf.js. I don't remember off the top of my head how to do it, unfortunately. – Vivin Paliath Feb 13 '15 at 16:01
  • 1
    If you want to use a path just call `PDF.getDocument('/path/to/my/doc')` inside the loadPdf() function, instead of passing in the binary data. – The Unknown Dev Jan 27 '16 at 20:08
  • @Jamil When I replace pdfBase64 for pdf file, the text layer stops working. – Damjan Pavlica Jul 30 '16 at 13:17
  • @DamjanPavlica If you are still calling `base64ToUint8Array` with `pdfBase64` then it won't work since it is not in base64 anymore. Try just `loadPdf("/path/to/my/file.pdf");` – The Unknown Dev Jul 30 '16 at 15:17
  • I am not calling base64ToUint8Array function, the PDF file gets opened fine, but text selection not working anymore... No errors in the console. – Damjan Pavlica Jul 30 '16 at 17:27
  • Can someone please provide working text selection example with a real PDF file? – Damjan Pavlica Jul 30 '16 at 17:34
  • My bad. It seems that pdf file was non-editable, the script it good. – Damjan Pavlica Jul 30 '16 at 19:03
  • What is CustomStyle? – VIKAS KOHLI Oct 16 '20 at 09:01
4

Because This is an old question and old accepted answer, to get it working with recent PDF.JS versions you may use this solution

http://www.ryzhak.com/converting-pdf-file-to-html-canvas-with-text-selection-using-pdf-js

Here is the code they used : Include the following CSS and scripts from the PDF.js code

<link rel="stylesheet" href="pdf.js/web/text_layer_builder.css" />
<script src="pdf.js/web/ui_utils.js"></script>
<script src="pdf.js/web/text_layer_builder.js"></script>

use this code to load the PDF :

PDFJS.getDocument("oasis.pdf").then(function(pdf){
    var page_num = 1;
    pdf.getPage(page_num).then(function(page){
        var scale = 1.5;
        var viewport = page.getViewport(scale);
        var canvas = $('#the-canvas')[0];
        var context = canvas.getContext('2d');
        canvas.height = viewport.height;
        canvas.width = viewport.width;

        var canvasOffset = $(canvas).offset();
        var $textLayerDiv = $('#text-layer').css({
            height : viewport.height+'px',
            width : viewport.width+'px',
            top : canvasOffset.top,
            left : canvasOffset.left
        });

        page.render({
            canvasContext : context,
            viewport : viewport
        });

        page.getTextContent().then(function(textContent){
           console.log( textContent );
            var textLayer = new TextLayerBuilder({
                textLayerDiv : $textLayerDiv.get(0),
                pageIndex : page_num - 1,
                viewport : viewport
            });

            textLayer.setTextContent(textContent);
            textLayer.render();
        });
    });
});    
Mosta
  • 868
  • 10
  • 23
  • See also examples https://github.com/mozilla/pdf.js/tree/master/examples/components – async5 Dec 31 '15 at 18:13
  • 3 years later, linked solution doesn't work. Getting same errors as others reported. pdf displays but no text selection. "TextLayerBuilder is not defined" and two console errors: "unexpected token export"(http://localhost:8080/pdf.js/web/ui_utils.js:879) and "unexpected token" { (http://localhost:8080/pdf.js/web/text_layer_builder.js:16) Just once I'd like to find examples that actually work. – user3217883 Mar 27 '19 at 19:12
  • 1
    @user3217883 I created this fiddle for you: https://jsfiddle.net/kingliam/7y13trao/. This text-selection example is working as of April 2019 – kingliam Apr 11 '19 at 02:01
  • as of August 2019 : Failed to load resource: net::ERR_BLOCKED_BY_CLIENT – Chris Tarasovs Aug 28 '19 at 20:51
1

If you want to render all pages of pdf document in different pages with text selection you can use either

  1. pdf viewer
  2. canvas and renderer to parse the text and append it over the canvas, so that it looks like text selection.

But on real scenario, if you are going to process with the canvas like zoom in/out then this canvas operation will terribly reduce your browser performance. please check the below url,

http://learnnewhere.unaux.com/pdfViewer/viewer.html

You could get the complete code from here https://github.com/learnnewhere/simpleChatApp/tree/master/pdfViewer

learn here
  • 21
  • 3