0

I have a (seemingly) easy task: Replace strings in an pdf file.

I found the nice solution itext7, but even after reading through the available documentation, I cannot adapt the code to my specific file structure. My sub so far, mostly adapted from that example: https://kb.itextpdf.com/home/it7kb/examples/replacing-pdf-objects It works for the example pdf file from the itext7 team, but not for my specific pdf. I already found out, why, but I don't understand how to adjust the code now.

Imports iText.Commons.Utils
Imports iText.Kernel.Pdf    

Sub PDFreplaceStrings(Filename As String, Placeholders() As String, Replacements() As String)
    Dim strTempFileName As String = My.Application.Info.DirectoryPath & "\Temp.pdf"
    Dim pdfDoc As PdfDocument
    Dim pdfCurPage As PdfPage
    Dim pdfDict As PdfDictionary
    Dim pdfObj As PdfObject
    Dim pdfStr As PdfStream = Nothing
    Dim pdfData() As Byte
    Dim strReplacement As String
    
    'Check parameters:
    If My.Computer.FileSystem.FileExists(FileName) = False or Placeholders.Length = 0 Or Replacements.Length = 0 or Placeholders.Length <> Replacements.Length Then exit sub

    'The main part:
    My.Computer.FileSystem.MoveFile(FileName, strTempFileName, True)
    pdfDoc = New PdfDocument(New PdfReader(strTempFileName), New PdfWriter(FileName))
    pdfCurPage = pdfDoc.GetFirstPage
    pdfDict = pdfCurPage.GetPdfObject
    pdfObj = pdfDict.Get(PdfName.Contents)
    If TypeOf pdfObj Is PdfStream Then 'the ideal case
        pdfStr = CType(pdfObj, PdfStream)
    ElseIf TypeOf pdfObj Is PdfArray Then 'my case
        'How do I get the stream now?
    End If
    
    If pdfStr IsNot Nothing Then
        pdfData = pdfStr.GetBytes 'this is the part from the original code which doesn't work for my file, since my pdfObj is of type PdfArray instead of pdfStream
        For i As Integer = 0 To Placeholders.Length - 1 'Since I want to replace multiple strings, I iterate now over an array of placeholders and replacement strings
            strReplacement = JavaUtil.GetStringForBytes(pdfData).Replace(Placeholders(i), Replacements(i))
            pdfStr.SetData((System.Text.Encoding.UTF8.GetBytes(strReplacement)))
        Next
    End If
    
    'Wrap-up:
    pdfDoc.Close()
    My.Computer.FileSystem.DeleteFile(strTempFileName)
end sub
Sub Test
    dim ar1() ar2() as string
    ar1 = {"#PPF#"}
    ar2 = {"30"}
    PDFreplaceStrings("BLP_Report-Test.pdf",ar1,ar2)
End Sub

My key problem is: I know that in case of my example file, the content of my page is of type PdfArray instead of PdfStream and that is why the original example code doesn't work on my file. But I don't understand how to adapt it from here. I have zero knowledge in Java, so it's really difficult for me to prescind from the itext7 documentation. Here is my specific example file with the placeholders: https://shimbox.shimadzu.eu/download/c9007175-15ce-42aa-bc06-ccd15b1499f0

Can you please give me a hint how to proceed?

FOBS
  • 13
  • 2

1 Answers1

1

This answer is not explaining how to implement the task at hand but instead why the task is a bad idea to begin with.

I have a (seemingly) easy task: Replace strings in an pdf file.

It indeed merely seems to be an easy task.

PDF foremost is a format that serves to draw a fixed appearance in the same way on different media. For this task it is not necessary to hold textual content in a form

  • in which the matching characters in a known encoding (e.g. Unicode) can be determined, or
  • in which characters forming words, paragraphs, or similar structures are kept together, let alone
  • in which the textual content can easily be edited.

Thus, plain PDF does not enforce textual content to be contained in such a form. Consequentially, it often is not in such a form.

The example from the iText site you found by itself may leave a different impression. But that example actually is the port of the code in this answer by Bruno Lowagie where he says that the code works if your PDFs are relatively simple and then explains

In real life, PDFs are never that simple and the complexity of your project will increase dramatically with every special feature that is used in your documents.

For some more details on those complexities see this answer.


In respect to your example document, for example, there are more issues than the one you found, i.e. that the content of my page is of type PdfArray instead of PdfStream, in particular:

  • The text in the content stream is not represented using some ASCII-like encoding but instead using the glyph ids from the font, and these ids then are hex-encoded. Furthermore, words are not represented as simple sequence of such hex encoded glyph ids but instead have kerning information in-between. Thus, the text line "Photosensitive Protection Factor = #PPF#" is drawn using this command:

    [<0033>56.000000<004B>16.000000<0052>26.000000<0057>37.000000<0052>16.000000<0056>50.000000<0048>16.000000<0051>16.000000<0056>40.000000<004C>62.000000<0057>47.000000<004C>72.000000<0059>40.000000<0048>26.000000<0003>37.000000<0033>56.000000<0055>23.000000<0052>16.000000<0057>47.000000<0048>26.000000<0046>30.000000<0057>47.000000<004C>72.000000<0052>16.000000<0051>16.000000<0003>57.000000<0029>70.000000<0044>16.000000<0046>50.000000<0057>37.000000<0052>16.000000<0055>23.000000<0003>47.000000<0020>53.000000<0003>47.000000<0006>16.000000<0033>46.000000<0033>56.000000<0029>70.000000<0006>-0.152344] TJ 
    

    The four-digit numbers in the angled brackets <....> represent the glyph IDs in question, so your "#PPF#" placeholder is <0006>16.000000<0033>46.000000<0033>56.000000<0029>70.000000<0006>.

  • The font in question is subset-embedded, i.e. it is embedded in the PDF and only contains the glyphs actually used in the PDF. For example, looking at the capital letters there is no 'C', 'G', 'I', 'J', 'K', 'M', 'N', 'O', 'Q', 'R', 'U', 'W', 'X', 'Y', or 'Z' available for replacement text.

Thus, instead of using text replacement consider using PDFs with form fields. You can fairly easily fill in form fields. And if you don't want others to (easily) be able to change those values, flatten the form fields after fill-in.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you for your comment. It's clear to me now that I had a really oversimplified idea what a pdf actually is and I now get an idea why the Word approach also fails spectacularly. The problem is: The pdfs are automatically generated by another, proprietary software and I have no influence on how it does so. I can only add "text boxes" in the "report editor" of this other software and therefore my idea with the "simple" placeholders which I thought were simple text. There is no possibility to add form fields in the editor of this other software. – FOBS Apr 05 '22 at 06:39
  • Well, it is feasible to determine, using text extraction with extras, to determine where those placeholders are. The placeholder characters in your example PDF are drawn in the correct order in a single instruction with no other characters in-between. If that's the case in general, removing them also is feasible. But due to the subset-embedded font one has to add replacement using a new font object. In particular your code will need to have the font in question available. – mkl Apr 05 '22 at 10:09
  • Ok, that sounds theoretically possible, but technically it clearly excels my programming skills. Maybe I have to think about another approach entirely. – FOBS Apr 05 '22 at 11:46