How do I replace a string in a PDF file using NodeJS?

Question

I have a template PDF file, and I want to replace some marker strings to generate new PDF files and save them. What's the best/simplest way to do this? I don't need to add graphics or anything fancy, just a simple text replacement, so I don't want anything too complicated.

Thanks!

Edit: Just found HummusJS, I'll see if I can make progress and post it here.

Hi there Manuel! Did you find a solution? – Fredefl Nov 17 '16 at 21:44 — Fredefl, Nov 17 '16 at 21:44
I have same situation, did you find a solution ? – M.Abulsoud Sep 30 '19 at 12:07 — M.Abulsoud, Sep 30 '19 at 12:07

score 12 · Answer 1 · edited Dec 16 '20 at 18:40

12

I found this question by searching, so I think it deserves the answer. I found the answer by BrighTide here: https://github.com/galkahana/HummusJS/issues/71#issuecomment-275956347

Basically, there is this very powerful Hummus package which uses library written in C++ (crossplatform of course). I think the answer given in that github comment can be functionalized like this:

var hummus = require('hummus');

/**
 * Returns a byteArray string
 * 
 * @param {string} str - input string
 */
function strToByteArray(str) {
  var myBuffer = [];
  var buffer = new Buffer(str);
  for (var i = 0; i < buffer.length; i++) {
      myBuffer.push(buffer[i]);
  }
  return myBuffer;
}

function replaceText(sourceFile, targetFile, pageNumber, findText, replaceText) {  
    var writer = hummus.createWriterToModify(sourceFile, {
        modifiedFilePath: targetFile
    });
    var sourceParser = writer.createPDFCopyingContextForModifiedFile().getSourceDocumentParser();
    var pageObject = sourceParser.parsePage(pageNumber);
    var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();
    var textStream = sourceParser.queryDictionaryObject(pageObject.getDictionary(), 'Contents');
    //read the original block of text data
    var data = [];
    var readStream = sourceParser.startReadingFromStream(textStream);
    while(readStream.notEnded()){
        Array.prototype.push.apply(data, readStream.read(10000));
    }
    var string = new Buffer(data).toString().replace(findText, replaceText);

    //Create and write our new text object
    var objectsContext = writer.getObjectsContext();
    objectsContext.startModifiedIndirectObject(textObjectId);

    var stream = objectsContext.startUnfilteredPDFStream();
    stream.getWriteStream().write(strToByteArray(string));
    objectsContext.endPDFStream(stream);

    objectsContext.endIndirectObject();

    writer.end();
}

// replaceText('source.pdf', 'output.pdf', 0, /REPLACEME/g, 'My New Custom Text');

UPDATE:
The version used at the time of writing an example was 1.0.83, things might change recently.

UPDATE 2: Recently I got an issue with another PDF file which had a different font. For some reason the text got split into small chunks, i.e. string QWERTYUIOPASDFGHJKLZXCVBNM1234567890- got represented as -286(Q)9(WER)24(T)-8(YUIOP)116(ASDF)19(GHJKLZX)15(CVBNM1234567890-) I had no idea what else to do rather than make up a regex.. So instead of this one line:

var string = new Buffer(data).toString().replace(findText, replaceText);

I have something like this now:

var string = Buffer.from(data).toString();

var characters = REPLACE_ME;
var match = [];
for (var a = 0; a < characters.length; a++) {
    match.push('(-?[0-9]+)?(\\()?' + characters[a] + '(\\))?');
}

string = string.replace(new RegExp(match.join('')), function(m, m1) {
    // m1 holds the first item which is a space
    return m1 + '( ' + REPLACE_WITH_THIS + ')';
});

edited Dec 16 '20 at 18:40

miguelmorin

5,025
4
29
64

answered Jun 28 '17 at 09:55

Alex K

6,737
9
41
63

what is the `strToByteArray`? – Alexey Sh. May 31 '18 at 22:16
@AlexeySh. i've attached it to example, it was in the original github comment – Alex K Jun 01 '18 at 07:27
@AlexeySh. if strings are vectorised in your PDF, it obviously won't work – Alex K Jul 13 '18 at 07:40
1

I am getting the following error: TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function – Nithin Nov 28 '18 at 07:56
1

@Nithin the version used was 1.0.83, maybe something changed... but did you try with the simplest pdf file first? is the text selectable when you open the pdf? – Alex K Nov 28 '18 at 09:45
Hey Alex, I am trying with a simple PDF file only. The text itself isn't getting selected. On running i get the following error before that itself. – Nithin Nov 28 '18 at 10:20
1

@Nithin that means the text is vectorized, you cannot replace it as it's being presented as vectors – Alex K Nov 28 '18 at 10:27
I am trying to work on the same PDF using hummus-recipe and it's working but when i try with hummus, its failing. – Nithin Nov 28 '18 at 10:50
1

@Nithin I would inspect what `pageObject.getDictionary().toJSObject()` returns, without trying to guess – Alex K Nov 28 '18 at 14:19
@AlexK, I installed 1.0.83 and it's not throwing any error, also it's not replacing the text. on inspecting i am getting these. { Contents: PDFIndirectObjectReference {}, MediaBox: PDFArray {}, Parent: PDFIndirectObjectReference {}, Resources: PDFIndirectObjectReference {}, Type: PDFName { value: 'Page' } } – Nithin Nov 28 '18 at 17:38
I still has same issue, does anyone found a solution ? – M.Abulsoud Sep 30 '19 at 12:15
1

@Nithin Where did you reach with this? – M.Abulsoud Oct 01 '19 at 15:06
It works on simple pdf file, but on complex ones I still has same issue, does anyone found a solution ? – M.Abulsoud Oct 01 '19 at 16:46
@M.Abulsoud there are differences in how the text might be embedded in the PDF. Sometimes it might even be vectorized... I still find this solution quite hacky and you need to make sure you control the PDF source – Alex K Oct 01 '19 at 17:02
@AlexK It's really seems a simple problem at the beginning, but it's really complex one. I'm doing a placeholders on my form templates using pdfotter template editor, and want to replace those placeholders with real data on my side. – M.Abulsoud Oct 01 '19 at 17:19
@M.Abulsoud check my latest update 2, it might be related – Alex K Oct 01 '19 at 17:27
@AlexK Thank you, my issue is different I got this error which most of users got: TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function – M.Abulsoud Oct 01 '19 at 21:37
1

Hey, So this solution worked but not as expected, This was more of a hack. So what i did was create a simple page using ejs templates and force puppeteer to make a PDF of that. It was clean, elegant and scalable. – Nithin Oct 06 '19 at 13:14

Syas · Answer 2 · 2022-04-15T01:08:23.280

Building on Alex's (and other's) solution, I noticed an issue where some non-text data were becoming corrupted. I tracked this down to encoding/decoding the PDF text as utf-8 instead of as a binary string. Anyways here's a modified solution that:

Avoids corrupting non-text data
Uses streams instead of files
Allows multiple patterns/replacements
Uses the MuhammaraJS package which is a maintained fork of HummusJS (should be able to swap in HummusJS just fine as well)
Is written in TypeScript (feel free to remove the types for JS)

import muhammara from "muhammara";

interface Pattern {
  searchValue: RegExp | string;
  replaceValue: string;
}

/**
 * Modify a PDF by replacing text in it
 */
const modifyPdf = ({
  sourceStream,
  targetStream,
  patterns,
}: {
  sourceStream: muhammara.ReadStream;
  targetStream: muhammara.WriteStream;
  patterns: Pattern[];
}): void => {
  const modPdfWriter = muhammara.createWriterToModify(sourceStream, targetStream, { compress: false });
  const numPages = modPdfWriter
    .createPDFCopyingContextForModifiedFile()
    .getSourceDocumentParser()
    .getPagesCount();

  for (let page = 0; page < numPages; page++) {
    const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile();
    const objectsContext = modPdfWriter.getObjectsContext();

    const pageObject = copyingContext.getSourceDocumentParser().parsePage(page);
    const textStream = copyingContext
      .getSourceDocumentParser()
      .queryDictionaryObject(pageObject.getDictionary(), "Contents");
    const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();

    let data: number[] = [];
    const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream);
    while (readStream.notEnded()) {
      const readData = readStream.read(10000);
      data = data.concat(readData);
    }

    const pdfPageAsString = Buffer.from(data).toString("binary"); // key change 1

    let modifiedPdfPageAsString = pdfPageAsString;
    for (const pattern of patterns) {
      modifiedPdfPageAsString = modifiedPdfPageAsString.replaceAll(pattern.searchValue, pattern.replaceValue);
    }

    // Create what will become our new text object
    objectsContext.startModifiedIndirectObject(textObjectID);

    const stream = objectsContext.startUnfilteredPDFStream();
    stream.getWriteStream().write(strToByteArray(modifiedPdfPageAsString));
    objectsContext.endPDFStream(stream);

    objectsContext.endIndirectObject();
  }

  modPdfWriter.end();
};

/**
 * Create a byte array from a string, as muhammara expects
 */
const strToByteArray = (str: string): number[] => {
  const myBuffer = [];
  const buffer = Buffer.from(str, "binary"); // key change 2
  for (let i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
  }
  return myBuffer;
};

And then to use it:

/**
 * Fill a PDF with template data
 */
export const fillPdf = async (sourceBuffer: Buffer): Promise<Buffer> => {
  const sourceStream = new muhammara.PDFRStreamForBuffer(sourceBuffer);
  const targetStream = new muhammara.PDFWStreamForBuffer();

  modifyPdf({
    sourceStream,
    targetStream,
    patterns: [{ searchValue: "home", replaceValue: "emoh" }], // TODO use actual patterns
  });

  return targetStream.buffer;
};

Getting this kind of results when trying to convert to string "Tm [<0003000400050006000700080006>] TJ" tried with utf-9 and with binary --> Buffer.from(data).toString('utf-8'); — Beni Gazala, Dec 04 '22 at 15:58

Tilal Ahmad · Answer 3 · 2020-08-17T12:06:52.760

There is another Node.js Package asposepdfcloud, Aspose.PDF Cloud SDK for Node.js. You can use it to replace text in your PDF document conveniently. Its free plan offers 150 credits monthly. Here is sample code to replace text in PDF document, don't forget to install asposepdfcloud first.

const { PdfApi } = require("asposepdfcloud");
const { TextReplaceListRequest }= require("asposepdfcloud/src/models/textReplaceListRequest");
const { TextReplace }= require("asposepdfcloud/src/models/textReplace");

// Get App key and App SID from https://aspose.cloud 
pdfApi = new PdfApi("xxxxx-xxxxx-xxxx-xxxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxb");

var fs = require('fs');

const name = "02_pages.pdf";
const remoteTempFolder = "Temp";
//const localTestDataFolder = "C:\\Temp";
//const path = remoteTempFolder + "\\" + name;
//var data = fs.readFileSync(localTestDataFolder + "\\" + name);
    
const textReplace= new TextReplace();
        textReplace.oldValue= "origami"; 
        textReplace.newValue= "aspose";
        textReplace.regex= false;

const textReplace1= new TextReplace();
        textReplace1.oldValue= "candy"; 
        textReplace1.newValue= "biscuit";
        textReplace1.regex= false;
    
const trr = new TextReplaceListRequest();
            trr.textReplaces = [textReplace,textReplace1];

// Upload File
//pdfApi.uploadFile(path, data).then((result) => {  
//                     console.log("Uploaded File");    
//                    }).catch(function(err) {
    // Deal with an error
//    console.log(err);
//});


// Replace text
pdfApi.postDocumentTextReplace(name, trr, null, remoteTempFolder).then((result) => {    
    console.log(result.body.code);                  
}).catch(function(err) {
    // Deal with an error
    console.log(err);
});

P.S: I'm developer evangelist at aspose.

How do I replace a string in a PDF file using NodeJS?

3 Answers3

Linked