14

I'm not trying to modify the PDF, I'm just trying to change the displayed text

pdf.js outputs text it reads in a bunch of divs .textLayer > div, it also draws a canvas

I read here that viewing and editing pdf in the browser is almost impossible, but...

Since pdf.js does have an API, my idea is to "hook" into pdf.js and change the displayed text (that's more than enough in my case)

The closest I could find is this function named getTextContent(), but there are no callback registered AFAICS.

Is this even possible (without messing with pdf.js itself)? If so, how?


EDIT (3)

This code will print the PDF text into console, but how to proceed from there is a mystery to me.

'use strict';

// In production, the bundled pdf.js shall be used instead of SystemJS.
Promise.all([System.import('pdfjs/display/api'),
System.import('pdfjs/display/global'),
System.import('pdfjs/display/network'),
System.resolve('pdfjs/worker_loader')])
    .then(function (modules)
    {
        var api = modules[0], global = modules[1];

        // In production, change this to point to the built `pdf.worker.js` file.
        global.PDFJS.workerSrc = modules[3];

        // Fetch the PDF document from the URL using promises
        let loadingTask        = api.getDocument('cv.pdf');

        loadingTask.onProgress = function (progressData) {
            document.getElementById('progress').innerText = (progressData.loaded / progressData.total);
        };

        loadingTask.then(function (pdf)
        {
            // Fetch the page.
            pdf.getPage(1).then(function (page)
            {
                var scale     = 1.5;
                var viewport  = page.getViewport(scale);

                // Prepare canvas using PDF page dimensions.
                var canvas    = document.getElementById('pdf-canvas');
                var context   = canvas.getContext('2d');
                canvas.height = viewport.height;
                canvas.width  = viewport.width;

                // (Debug) Get PDF text content
                page.getTextContent().then(function (textContent)
                {
                    console.log(textContent);
                });

                // Render PDF page into canvas context.
                var renderContext =
                {
                    canvasContext: context,
                    viewport     : viewport
                };
                page.render(renderContext);
            });
        });
    });

EDIT (2)

The code example that I'm trying to mess with is viewer.js. Granted it's not the easiest example, but it's the simplest one that I could find that implements text in DOM


EDIT (1)

I did try to manipulate the DOM (specifically the .textLayer > div I mentioned earlier), but pdf.js uses both DIVs and canvas to do its magic, it's not just text, so the result was text div shown on top of the canvas (or the other way around), see:

https://i.stack.imgur.com/JvEUN.jpg

Neels
  • 2,547
  • 6
  • 33
  • 40
TheDude
  • 3,045
  • 4
  • 46
  • 95
  • PDF.js "converts" pdf to html, if the text is indeed text and not an image of text then you should be able to manipulate the html directly – Jaromanda X Aug 15 '17 at 05:28
  • @JaromandaX I edited my post, but I got stuck with canvas tag, THAT would be awesome if I could achieve text manipulation using just the DOM – TheDude Aug 15 '17 at 06:22
  • I think it can be done. It has a promise after the document finish loading and it returns the document itself. Can you update your question and provide a complete example on how are you using it right now? – Christos Lytras Aug 17 '17 at 15:34
  • @ChristosLytras: my apologies for the delay, I edited my post to point to code example (I tried to come up with my own example, but I miserably failed :() – TheDude Aug 20 '17 at 13:37
  • If I got you correctly you want to modify the text that is in the PDF file. I would get the SVG version of the template you want to edit, change the text in the SVG and then convert that SVG to PDF - pdf.js is used only for viewing pdf files not editing Its content. – Muhamed Krlić Aug 24 '17 at 09:42

2 Answers2

9

The reason for the first edit effect is because pdfjs uses hidden div elements to enable text selection. In order to prevent pdfjs from rendering text on the canvas without modifying the script you can add the following code:

CanvasRenderingContext2D.prototype.strokeText = function () { };
CanvasRenderingContext2D.prototype.fillText = function () { };

Also if you want to avoid the text manipulation in the html elements you can render them yourself with the same method you print to console. Here is a working jsfiddle that changes Hello, world! to Burp! :)

The jsfiddle was created from the following resources:

vl4d1m1r4
  • 1,688
  • 12
  • 21
  • Thank you for the jsfiddle...at first glance this does what I want, but I need to test it on a more "realistic" pdfs (I expect 99% of the PDFs that I'll deal with will be mixed - text + images), I'll get back to you asap! – TheDude Aug 22 '17 at 21:53
  • I'm still looking the the code you posted, there are 2 issues with it: it uses PDFJS, which is [marked as deprecated](https://github.com/mozilla/pdf.js/blob/cb10c03d0a83331fa3e8a0ad8c5162fc1a357f56/src/display/global.js#L39) and doesn't reproduce the correct color, but other than it's brilliant. – TheDude Aug 23 '17 at 12:41
  • Putting the deprecated issue aside, it would be awesome to be able to fix the color issue (which is more urgent then the other one) – TheDude Aug 23 '17 at 12:43
  • 1
    This doesn't seem to work anymore, can we get an update? – pguardiario Apr 27 '19 at 03:36
3

You can make extra code in pdf.js.

getTextContent: function PDFPageProxy_getTextContent(params) {
      return this.transport.messageHandler.sendWithPromise('GetTextContent', {
        pageIndex: this.pageNumber - 1,
        normalizeWhitespace: params && params.normalizeWhitespace === true ? true : false,
        combineTextItems: params && params.disableCombineTextItems === true ? false : true
      });
    }

In above code you can check if getTextContent is called by adding console.log and add more content you want.

artgb
  • 3,177
  • 6
  • 19
  • 36
  • Thank you, but I'm not sure how to put that in use, the code you posted looks very similar to `streamTextContent()` (the function just above `getTextContent()` in `api.js`). I edited my post to add some code, I' appreciate it if you can point me to some direction, thanks! – TheDude Aug 20 '17 at 13:43
  • 1
    Would this url help you?https://stackoverflow.com/questions/10273309/need-to-hook-into-a-javascript-function-call-any-way-to-do-this – artgb Aug 21 '17 at 00:01