2

As part of our automated build process, we'd like to patch the build number in a series of PDF files (our reference guides). The clean way would consist in automating some macro in LibreOffice to update a field and output the PDF again.

However I'd like to know if there is a more direct (yet arguably dirty) solution that would consist in running some binary Find and Replace of a placeholder in the PDF file. The contents doesn't seem to appear in clear text in the PDF though. Is there any trick that would help?

Serge Wautier
  • 21,494
  • 13
  • 69
  • 110
  • How should the build number be retrievable? Should it be visible in a normal PDF viewer? Or should it be contained in a hidden place? – mkl Aug 30 '16 at 08:46
  • It should be visible in the text, such as in a footer or appendice – Serge Wautier Aug 30 '16 at 09:09
  • In that case most likely @Bruno's answer shows what a quick and dirty solution would look like. – mkl Aug 30 '16 at 15:53

1 Answers1

2

The number isn't available in clear text because it is part of a content stream that is compressed.

Take a "Hello World!" example. The content stream that represent that text could looks like this:

2 0 obj
<</Length 65/Filter/FlateDecode>>stream
xœ+är
á26S°00SIá2PÐ5´ 1ôÝBÒ¸4<RsròÂó‹rR5C²€j@*\C¸¹ Çq°
endstream
endobj

When you decompress the binary part, you'll find this:

q
BT
36 806 Td
0 -18 Td
/F1 12 Tf
(Hello World!) Tj
0 0 Td
ET
Q

However, the following syntax would also be correct:

BT
/F1 12 Tf
88.66 806 Td
(ld!) Tj
-22 0 Td
(Wor) Tj
-15.33 0 Td
(llo) Tj
-15.33 0 Td
(He) Tj
ET

This syntax is much harder to read, but if you do all the math and reorganize the different text snippets based on the changes to the text matrix, you'll discover that the output is identical to the output of the syntax we had before.

If your PDFs are created in a straight-forward way, meaning that the strings can be easily recognized in the decompressed syntax, you could get the content stream of a page, decompress it, change it, compress it, and put it back in the PDF.

This would also assume that the String you are looking for is present in the content stream of the page, and not in an external content stream; that is: in a Form XObject.

If all these assumptions are met, you could use iText like this:

PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
int total = reader.getNumberOfPages() + 1;
for (int i = 1; i < total; i++) {
    byte[] content = reader.getPageContent(i);
    byte[] alteredBytes = doSomethingWith(content);
    reader.setPageContent(i, alteredBytes);
}
stamper.close();
reader.close();

You have to implement the doSomethingWith() method so that it performs the binary search & replace you need.

Important: you asked for a quick & dirty way, and this is a very quick & dirty way. If I would see one of my employees submitting this code, I'd fire him or her on the spot if he or she can't give me any decent argument to use this code. This code will fail for many PDFs, but it might be just what you need in your very specific use case.

You might also want to read: iText or iTextSharp rudimentary text edit

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Thanks for such a comprehensive answer. Very insightful! We'll give up on the moment and go for the clean solution (automate libreoffice) later. – Serge Wautier Aug 31 '16 at 07:52