Error replacing text in PDF with NodeJS and hummus

Question

How do I replace a string in a PDF file using NodeJS? has a solution to replace text in a PDF. With the same code, I have a puzzling issue: the text is replaced in the source code of the PDF but does not render. The relevant lines, adapted from the above solution, are:

  console.log(replaceText);
  var string = new Buffer(data).toString().replace(findText, replaceText);
  console.log(string);

The console shows that it is replaced in the source string of the PDF:

/TT0 1 Tf
65.5689 -24.5097 24.5097 65.5689 363.9941 762.3682 Tm
(e)Tj
61.1539 -34.0617 34.0617 61.1539 381.1689 756.6411 Tm
(n)Tj
54.8214 -43.5272 43.5272 54.8214 408.6333 741.0947 Tm
(d)Tj
48.8331 -50.153 50.153 48.8331 426.3779 726.999 Tm
(a)Tj
52 0 0 52 75.8203 226.9756 Tm
(abcdefghijklmnopqrstuvwxyz)Tj
33 0 0 33 25.8203 302.9756 Tm
(E)Tj
(ste cheque-prenda, para:)Tj
1.818 -7.152 Td
(www.emocoes.org/abcdefghijklmnopqrstuvwxyz)Tj
ET

and the PDF looks like this:

The K, X, and Y are missing in this case. Opening the file in Adobe Illustrator shows they are still there behind other letters:

I could not find a definite pattern: sometimes H and J are also missing with other replacemenet strings, and the missing letters are different with other fonts (I tested Open Sans and Times New Roman).

What is the problem, and how can I fix it?

My code is:

function customizeVoucher(findText, replaceText) {
  var sourceFile = path.join(__dirname, "../private/vouchers/custom-old.pdf");
  var link = "/vouchers/cheque-prenda-" + replaceText + ".pdf";
  var targetFile = path.join(__dirname, "../private" + link);
  var pageNumber = 0;
  
  var writer = hummus.createWriterToModify(sourceFile, {
    modifiedFilePath: targetFile,
    log: path.join(__dirname, "../hummus.md")
  });
  var sourceParser = writer.createPDFCopyingContextForModifiedFile().getSourceDocumentParser();
  var pageObject = sourceParser.parsePage(pageNumber);
  var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();
  var textStream = sourceParser.queryDictionaryObject(pageObject.getDictionary(), 'Contents');
  //read the original block of text data
  var data = [];
  var readStream = sourceParser.startReadingFromStream(textStream);
  while(readStream.notEnded()){
    Array.prototype.push.apply(data, readStream.read(10000));
  }
  console.log(replaceText);
  var string = new Buffer(data).toString().replace(findText, replaceText);
  console.log(string);

  // Create and write our new text object.
  var objectsContext = writer.getObjectsContext();
  objectsContext.startModifiedIndirectObject(textObjectId);
  
  var stream = objectsContext.startUnfilteredPDFStream();
  stream.getWriteStream().write(strToByteArray(string));
  objectsContext.endPDFStream(stream);
  
  objectsContext.endIndirectObject();
  
  writer.end();

  return link;
}

and the source PDF is here.

Most likely the font only is embedded as a subset, only the originally required glyphs are present. This is just one situation of many in which the oversimplified text replacement method you refer to fails. — mkl, Dec 23 '20 at 22:35
@mkl Yes indeed: the file with all the glyphs hidden does not have this problem. Illustrator could render it probably because it has the source font. Can you write an answer and recommend a more reliable solution? — miguelmorin, Dec 24 '20 at 15:02
I could write an answer but i don't have a more reliable Javascript solution. — mkl, Dec 24 '20 at 18:38
That's OK, can you add a more reliable method that is not Javascript? — miguelmorin, Dec 26 '20 at 09:46
In Java with itext I'd apply text extraction with coordinates first (to find the text to replace), remove the text using redaction at those coordinates, and add the replacement as new Text. When im back in office next year, i can write something up to that effect in more detail. — mkl, Dec 26 '20 at 11:05

score 1 · Accepted Answer · answered Jan 06 '21 at 15:22

In your example PDF two fonts are used, MyriadPro-Regular and AmaticSC-Bold, and both are embedded only as subsets:

and

Thus, when you use your code to replace strings in text showing instructions, only glyphs from the respective subset of the font selected before that instruction are visible in regular PDF viewers. Adobe Illustrator on the other hand for editing purposes uses the full fonts instead if they are available locally.

If you created the template PDF yourself and can still re-create it, do so but make sure all the required glyphs are embedded. You can make sure by somewhere putting an invisible string with all required characters from the respective font; this usually makes PDF creators embed all those glyphs. You can draw an invisible string by using the text rendering mode invisible, by drawing white-on-white, by covering with something else, by drawing outside the clip path or outside the page boundaries, ...

If you cannot re-create the template, you can do as gal kahana has proposed in the Hummus issue "How to search and replace text within a document?" referred to by the answer you took your code from:

the tricky part is to add any new characters to the font definition. Assuming that the PDF has only the characters it needs for rendering the text is already has, this means probably that you need to know which original font was used...realizing it from the PDF is not very easy, but can be done. doing the actual embedding...you are probably better of creating a new font using hummus, with the same name, and writing all the text using that font. simply replace the Tf command placing the old font with the new one, and use Tjs to place the new text

In case of your example PDF you have

/TT0 1 Tf
65.5689 -24.5097 24.5097 65.5689 363.9941 762.3682 Tm
(e) Tj
61.1539 -34.0617 34.0617 61.1539 381.1689 756.6411 Tm
(n) Tj
54.8214 -43.5272 43.5272 54.8214 408.6333 741.0947 Tm
(d) Tj
48.8331 -50.153 50.153 48.8331 426.3779 726.999 Tm
(a) Tj
52 0 0 52 75.8203 226.9756 Tm
(nomegenerico) Tj
33 0 0 33 25.8203 302.9756 Tm
(E) Tj
(ste cheque-prenda, para:) Tj
1.818 -7.152 Td
(www.emocoes.org/nomegenerico) Tj

So if you use Hummus to add a complete enough copy of the AmaticSC-Bold font to the page resources with a new name, e.g. ASCB, you then would replace

(nomegenerico) Tj

by

/ASCB 1 Tf
(REPLACEMENT_TEXT) Tj
/TT0 1 Tf

and also

(www.emocoes.org/nomegenerico) Tj

by

/ASCB 1 Tf
(www.emocoes.org/REPLACEMENT_TEXT) Tj
/TT0 1 Tf

to follow gal kahana's advice.

Beware: While the approach discussed above will most likely work in case of your template, the general case is much more complicated, see this answer for some backgrounds.

The least you'd have to do for a more generic solution is to take the font encoding into account. In your PDF both fonts are used with WinAnsiEncoding which is pretty much like Latin-1 but in general each font can have its own encoding, and that encoding need not be a standard encoding but may instead be a completely custom one. This requires that you keep track of which font is currently set in the content stream and look up the corresponding information from the font resource to interpret the following text strings correctly.

Gal kahana explained how to do this with Hummus in the article "Extracting Text from PDF files". For a generic text replacement method you "merely" have to extend the code provided there to allow replacing instructions drawing specific text pieces.

Error replacing text in PDF with NodeJS and hummus

1 Answers1