9

I' ve been searching the Internet for 2 Weeks and found some interesting solutions for my Problem, but nothing seems to give me the answer.

My goal is to do the folowing:

I want to find a Text in a static PDF-File and replace this text with another text. I would like to keep the design of the content. Is it really that hard?

I found a way but I lost the whole information:

 using (PdfReader reader = new PdfReader(path))
        {

            StringBuilder text = new StringBuilder();
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
                text.Replace(txt_SuchenNach.Text, txt_ErsetzenMit.Text);
            }

            return text.ToString();
        }

The second try I had was way better, but needs fields where I can change the text inside:

 string fileNameExisting =path;
        string fileNameNew = @"C:\TEST.pdf";

        using (FileStream existingFileStream = new FileStream(fileNameExisting, FileMode.Open))
        using (FileStream newFileStream = new FileStream(fileNameNew, FileMode.Create))
        {
            // PDF öffnen
            PdfReader pdfReader = new PdfReader(existingFileStream);


            PdfStamper stamper = new PdfStamper(pdfReader, newFileStream);

            var form = stamper.AcroFields;
            var fieldKeys = form.Fields.Keys;
            foreach (string fieldKey in fieldKeys)
            {                    
                var value = pdfReader.AcroFields.GetField(fieldKey);
                form.SetField(fieldKey, value.Replace(txt_SuchenNach.Text, txt_ErsetzenMit.Text));
            }

            // Textfeld unbearbeitbar machen (sieht aus wie normaler text)
            stamper.FormFlattening = true;

            stamper.Close();
            pdfReader.Close();
        }

This keeps the formatation of the rest of text and does only change my searched text. I need a solution for text which is NOT in a Textfield.

thanks for all your answers and your help.

EugenSunic
  • 13,162
  • 13
  • 64
  • 86
Kevin Plaul
  • 115
  • 1
  • 1
  • 12
  • 2
    "Is it really that hard?" Yes, generally speaking it is. Are you aware of *font subsetting*? What if you insert a character that is not in the existing subset? You would need to find out what font was used originally (not always trivial) and then *have* that font on your system. (There are other problems than this -- I see this is a duplicate question.) – Jongware Apr 13 '15 at 08:43
  • Hi Jongware, I know there is this already a post like mine, but without any "Maybe"-Code and the answer "NO" with is not really a good answer. =) But thank you, for your Comment. I hate PDF – Kevin Plaul Apr 13 '15 at 09:07
  • 1
    "No it can't be done" *is* a good answer. No matter how long you search the internet, you cannot find a method to walk from Britain to America. – Jongware Apr 13 '15 at 09:10

2 Answers2

5

The general issue is that text objects may use embedded fonts with specific glyphs assigned to specific letters. I.e. if you have a text object with some text like "abcdef" then the embedded font may contain glyphs for these ("abcdef" letters) only but not for other letters. So if you replace "abcdef" with "xyz" then the PDF will not display these "xyz" as no glyphs are available for these letters to be displayed.

So I would consider the following workflow:

  • Iterate through all the text objects;
  • Add new text objects created from scratch on top of PDF file and set the same properties (font, position, etc) but with a different text; This step could require you to have the same fonts installed on your as were used in the original PDF but you may check for installed fonts and use another font for a new text object. This way iTextSharp or another PDF tool will embed a new font object for a new text object.
  • Remove original text object once you have created a duplicated text object;
  • Process every text object with the workflow described above;
  • Save the modified PDF document into a new file.
Community
  • 1
  • 1
Eugene
  • 2,820
  • 19
  • 24
  • 2
    Amen to that. I also like the comments by @Jongware because they clearly explain why the OP is trying to use PDF for something it should be used for. I want to replace one String by another in PDF and keep all styles and have the text reflow, is a question that sounds like "I want to watch TV on my radio" and remarks such as "I hate eating soup with a fork". – Bruno Lowagie Apr 13 '15 at 10:19
  • Very good, thorough explanation indeed! I think we'll redirect duplicates to this answer from now on! If the OP still doesn't like it I'd encourage them to click to each user's profile that's replied so far and look at their tags. They'll find a combined score of over 1,000 in the [pdf] categories so I think they received a very knowledgeable response. – Chris Haas Apr 13 '15 at 13:20
2

I have worked on the same requirement and I am able to achieve this by the following steps.

Step1: Locating Source Pdf File and Destination file Path

Step2: Read Source Pdf file and Searching for the location of string that we want to replace

Step3: Replacing the string with new one.

using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using PDFExtraction;    
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;

namespace PDFReplaceTextUsingItextSharp
{
    public partial class ExtractPdf : System.Web.UI.Page
    {
        static iTextSharp.text.pdf.PdfStamper stamper = null;
        protected void Page_Load(object sender, EventArgs e)
        {

        }

        protected void Replace_Click(object sender, EventArgs e)
        {
            string ReplacingVariable = txtReplace.Text; 
            string sourceFile = "Source File Path";
            string descFile = "Destination File Path";
            PdfReader pReader = new PdfReader(sourceFile);
            stamper = new iTextSharp.text.pdf.PdfStamper(pReader, new System.IO.FileStream(descFile, System.IO.FileMode.Create));
            PDFTextGetter("ExistingVariableinPDF", ReplacingVariable , StringComparison.CurrentCultureIgnoreCase, sourceFile, descFile);
            stamper.Close();
            pReader.Close();
        }


        /// <summary>
        /// This method is used to search for the location words in pdf and update it with the words given from replacingText variable
        /// </summary>
        /// <param name="pSearch">Searchable String</param>
        /// <param name="replacingText">Replacing String</param>
        /// <param name="SC">Case Ignorance</param>
        /// <param name="SourceFile">Path of the source file</param>
        /// <param name="DestinationFile">Path of the destination file</param>
        public static void PDFTextGetter(string pSearch, string replacingText, StringComparison SC, string SourceFile, string DestinationFile)
        {
            try
            {
                iTextSharp.text.pdf.PdfContentByte cb = null;
                iTextSharp.text.pdf.PdfContentByte cb2 = null;
                iTextSharp.text.pdf.PdfWriter writer = null;
                iTextSharp.text.pdf.BaseFont bf = null;

                if (System.IO.File.Exists(SourceFile))
                {
                    PdfReader pReader = new PdfReader(SourceFile);


                    for (int page = 1; page <= pReader.NumberOfPages; page++)
                    {
                        myLocationTextExtractionStrategy strategy = new myLocationTextExtractionStrategy();
                        cb = stamper.GetOverContent(page);
                        cb2 = stamper.GetOverContent(page);

                        //Send some data contained in PdfContentByte, looks like the first is always cero for me and the second 100, 
                        //but i'm not sure if this could change in some cases
                        strategy.UndercontentCharacterSpacing = (int)cb.CharacterSpacing;
                        strategy.UndercontentHorizontalScaling = (int)cb.HorizontalScaling;

                        //It's not really needed to get the text back, but we have to call this line ALWAYS, 
                        //because it triggers the process that will get all chunks from PDF into our strategy Object
                        string currentText = PdfTextExtractor.GetTextFromPage(pReader, page, strategy);

                        //The real getter process starts in the following line
                        List<iTextSharp.text.Rectangle> MatchesFound = strategy.GetTextLocations(pSearch, SC);

                        //Set the fill color of the shapes, I don't use a border because it would make the rect bigger
                        //but maybe using a thin border could be a solution if you see the currect rect is not big enough to cover all the text it should cover
                        cb.SetColorFill(BaseColor.WHITE);

                        //MatchesFound contains all text with locations, so do whatever you want with it, this highlights them using PINK color:

                        foreach (iTextSharp.text.Rectangle rect in MatchesFound)
                        {
                            //width
                            cb.Rectangle(rect.Left, rect.Bottom, 60, rect.Height);
                            cb.Fill();
                            cb2.SetColorFill(BaseColor.BLACK);
                            bf = BaseFont.CreateFont(BaseFont.HELVETICA_BOLD, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);

                            cb2.SetFontAndSize(bf, 9);

                            cb2.BeginText();
                            cb2.ShowTextAligned(0, replacingText, rect.Left, rect.Bottom, 0);   
                            cb2.EndText();
                            cb2.Fill();
                        }

                    }
                }

            }
            catch (Exception ex)
            {

            }

        }

    }
}
  • Where do you "replace"? In particular, where do you remove the original text and where do you add new text *using the same style as the original*? – mkl Dec 14 '16 at 12:20
  • cb = stamper.GetOverContent(page); cb2 = stamper.GetOverContent(page); here cb will take the text content over pdf page and cb2 will take the white back ground of the pdf page.............first we will search the position of existing string and store it in "MatchesFound" variable and then fill white color on the existing string cb.SetColorFill(BaseColor.WHITE)....after that we will loop matchesfound object and fill the new string in the same postion of white painted string...hope you are getting me... – Pradeep Kumar Dec 16 '16 at 05:06
  • 1
    *fill white color on the existing string* - that is not **removing** as the text can still be copied&pasted. As long as the pdf shall only be printed, that is OK, but if it shall still be electronically distributed, that can be a show-stopper. – mkl Dec 16 '16 at 05:45
  • yes..true it is not feasible in the distribution case...feasible only for downloading of pdf form after modifications – Pradeep Kumar Dec 16 '16 at 07:11
  • Do you have an update for itextpdf v7? `PdfStamper` does not appear to be existing in v7 :( – CularBytes Jun 27 '19 at 17:21
  • 4
    Can you please add your custom inherited class "myLocationTextExtractionStrategy" ? What it does? – Tech Yogesh Jan 18 '21 at 11:10