2

I'm searching for a way to replace the text in a pdf in C#. The use case is we have a client that needs to sign a pdf and we want to pre populate a few of the fields before they download it. Things like date, name, title, etc. I've found a few potential options like PDFSharp however I can't seem to find a way to search based on text.

Resources I've found so far are:

Find a word in PDF using PDFSharp .

https://forum.pdfsharp.net/viewtopic.php?p=4010

However I wasn't able to get them working for my use case. Any help would be greatly appreciated.

UPDATE Here is the boiler plate code that I've been working with to try to do the search and replace:

String toFind = 'client-title';
String toReplace = 'John Doe';
PdfSharp.Pdf.PdfDocument PDFDoc = PdfReader.Open("path/to/original/file.pdf", PdfDocumentOpenMode.Import);
PdfSharp.Pdf.PdfDocument PDFNewDoc = new PdfSharp.Pdf.PdfDocument();

for(int i = 0; i < PDFDoc.Pages.Count; i++)
{
    // Find toFind string and replace with toReplace string

    PDFNewDoc.AddPage(PDFDoc.Pages[i]);
}
PDFNewDoc.Save("path/to/new/file.pdf");
  • 1
    PDF contains elements: text, images, etc.. You should find way to get them, change them, create document again from them or get PDF doc object, change its element and save. The main problem you can be facing is string "STRING" can be represented with 3 text elements: "ST" "RI" "NG". However, I was writing logic for it concatenation and it works well. You can check itextsharp.pdf I think. – Woldemar89 Jan 30 '19 at 18:34
  • @woldemar Thanks for your quick response. I was able to open the pdf and duplicate it and its contents and re save the file using PdfSharp but unable to access the actual words in the file. I looked into itextsharp and it seems like it may be able to do what I'm looking for however they're not free. I'm really hoping to find an open-source solution to this issue. – Cyrille Gindreau Jan 30 '19 at 18:45
  • https://github.com/itext/itextsharp/blob/master/LICENSE.md can you attach PDF doc example and point what word needs to be changed? – Woldemar89 Jan 30 '19 at 18:47
  • Can you post the code(s) you've tried so far? We may be able to expand on that to help. – Nathan Champion Jan 30 '19 at 18:49
  • 2
    Indeed replacing a string can be really non-trivial, depending on how the pdf generator actually generated the page contents. Thus, a representative example pdf is needed to get an idea how best to implement the replacement. That being said, form fill-ins usually are preferred to text replacement in pdfs. – mkl Jan 30 '19 at 18:53
  • @woldemar You can see in the last paragraph here that a license must be purchased: https://github.com/itext/itextsharp/tree/master – Cyrille Gindreau Jan 30 '19 at 19:21
  • @NathanChampion I have added the base code that I am working with. – Cyrille Gindreau Jan 30 '19 at 19:21
  • @mkl The reason we're doing text replacements is because the original forms can have "invisible" text and that is what we're searching for and replacing. The idea is that we can have something generic enough that we would be able to post any number of different forms with text in different locations and we would be able to pre populate those fields with what the text tag said it needed. – Cyrille Gindreau Jan 30 '19 at 19:21
  • 1
    I have the only idea: you should create or edit PDF using any editor and add *AcroFields* to it. Then use this PDF and fill *AcroFields*: https://www.c-sharpcorner.com/article/fill-in-pdf-form-fields-using-the-open-source-itextsharp-dll/ – Woldemar89 Jan 30 '19 at 20:37
  • Unfortunately, the PDF format is not really designed for editing. Perhaps you could create PDF Forms in the form and then programmatically fill those in? [This answer](https://stackoverflow.com/a/6347519/2557128) suggests you could use a Reader feature to do this. – NetMage Jan 30 '19 at 21:29
  • @CyrilleGindreau an approach with itext would be to first apply text extraction with coordinates. In the extracted text you locate your search term and determine its bounding box. Then you redact away the content of that box and add the replacement objects in that area. – mkl Jan 30 '19 at 22:09
  • I strongly suggest to make that document to a form with fields. Then, prefilling would become quite easy. What you want to accomplish on document level will require to fully interpret the PDF, and recreate a new one… so, better do forms. – Max Wyss Jan 31 '19 at 03:18

1 Answers1

0

My sample below simply replaces the word 'Hello' with 'Hola'

class Program
    {
        static void Main(string[] args)
        {
            string originalPdf = @"C:\origPdf.pdf";

            CreatePdf(originalPdf);

            using (var doc = PdfReader.Open(originalPdf, PdfDocumentOpenMode.Modify))
            {
                var page = doc.Pages[0];
                var contents = ContentReader.ReadContent(page);

                ReplaceText(contents, "Hello", "Hola");
                page.Contents.ReplaceContent(contents);

                doc.Pages.Remove(page);
                doc.AddPage().Contents.ReplaceContent(contents);
               
                doc.Save(originalPdf);
            }

            Process.Start(originalPdf);

        }

        // Code from http://www.pdfsharp.net/wiki/HelloWorld-sample.ashx
        public static void CreatePdf(string filename)
        {
            // Create a new PDF document
            PdfDocument document = new PdfDocument();
            document.Info.Title = "Created with PDFsharp";

            // Create an empty page
            PdfPage page = document.AddPage();

            // Get an XGraphics object for drawing
            XGraphics gfx = XGraphics.FromPdfPage(page);

            // Create a font
            XFont font = new XFont("Verdana", 20, XFontStyle.BoldItalic, new XPdfFontOptions(PdfFontEncoding.WinAnsi));

            // Draw the text
            gfx.DrawString("Hello, World!", font, XBrushes.Black,
              new XRect(0, 0, page.Width, page.Height),
              XStringFormats.Center);

            // Save the document...
            document.Save(filename);
            // ...and start a viewer.
        }

        // Please refer to the pdf tech specs on what all entails in the content stream
        // https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
        public static void ReplaceText(CSequence contents, string searchText, string replaceText)
        {
            // Iterate thru each content items. Each item may or may not contain the entire
            // word if there are different stylings (ex: bold parts of the word) applied to a word.
            // So you may have to replace a character at a time.
            for (int i = 0; i < contents.Count; i++)
            {
                if (contents[i] is COperator)
                {
                    var cOp = contents[i] as COperator;
                    for (int j = 0; j < cOp.Operands.Count; j++)
                    {
                        if (cOp.OpCode.Name == OpCodeName.Tj.ToString() ||
                            cOp.OpCode.Name == OpCodeName.TJ.ToString())
                        {
                            if (cOp.Operands[j] is CString)
                            {
                                var cString = cOp.Operands[j] as CString;
                                if (cString.Value.Contains(searchText))
                                {
                                    cString.Value = cString.Value.Replace(searchText, replaceText);
                                }

                            }
                        }
                    }


                }
            }


        }
    }```
reas
  • 379
  • 1
  • 6
  • 1
    This only works in benign circumstances. Like matching font encodings, fonts not subsetted, no kerning applied,... – mkl Feb 02 '21 at 06:31
  • @mkl Why the downvote? I think my answer satisfies the question. OP did not asked about matching font encodings and such. – reas Feb 02 '21 at 19:05
  • I only commented. The downvote was someone else. Possibly the person who backed my comment. – mkl Feb 02 '21 at 21:34
  • I reserve downvotes for answers that don't work at all or completely ignore the question. Your answer does work in benign circumstances, and one can enforce such circumstances if one controls the template. – mkl Feb 02 '21 at 21:53