2

How would I go about replacing / removing text from a PDF file?

I have a PDF file that I obtained somewhere, and I want to be able to replace some text within it.

Or, I have a PDF file that I want to obscure (redact) some of the text within it so that it's no longer visible [and so that it looks cool, like the CIA files].

Or, I have a PDF that contains global Javascript that I want to stop from interrupting my use of the PDF.

BevanWeiss
  • 135
  • 1
  • 15

2 Answers2

2

This is possible in a limited fashion with the use of iText / iTextSharp. It will only work with Tj/TJ opcodes (i.e. standard text, not text embedded in images, or drawn with shapes).

You need to override the default PdfContentStreamProcessor to act on the page content streams, as presented by Mkl here Removing Watermark from PDF iTextSharp. Inherit from this class, and in your new class look for the Tj/TJ opcodes, the operand(s) will generally be the text element(s) (for a TJ this may not be straightforward text, and may require further parsing of all the operands).

A pretty basic example of some of the flexibility around iTextSharp is available from this github repository https://github.com/bevanweiss/PdfEditor (code excerpts below also)

NOTE: This utilises the AGPL version of iTextSharp (and is hence also AGPL), so if you will be distributing executables derived from this code or allowing others to interact with those executables in any way then you must also provide your modified source code. There is also no warranty, implied or expressed, related to this code. Use at your own peril.

PdfContentStreamEditor

using System.Collections.Generic;

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFCleaner
{
    public class PdfContentStreamEditor : PdfContentStreamProcessor
    {
        /**
         * This method edits the immediate contents of a page, i.e. its content stream.
         * It explicitly does not descent into form xobjects, patterns, or annotations.
         */
        public void EditPage(PdfStamper pdfStamper, int pageNum)
        {
            var pdfReader = pdfStamper.Reader;
            var page = pdfReader.GetPageN(pageNum);
            var pageContentInput = ContentByteUtils.GetContentBytesForPage(pdfReader, pageNum);
            page.Remove(PdfName.CONTENTS);
            EditContent(pageContentInput, page.GetAsDict(PdfName.RESOURCES), pdfStamper.GetUnderContent(pageNum));
        }

        /**
         * This method processes the content bytes and outputs to the given canvas.
         * It explicitly does not descent into form xobjects, patterns, or annotations.
         */
        public virtual void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
        {
            this.Canvas = canvas;
            ProcessContent(contentBytes, resources);
            this.Canvas = null;
        }

        /**
         * This method writes content stream operations to the target canvas. The default
         * implementation writes them as they come, so it essentially generates identical
         * copies of the original instructions the {@link ContentOperatorWrapper} instances
         * forward to it.
         *
         * Override this method to achieve some fancy editing effect.
         */

        protected virtual void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
        {
            var index = 0;

            foreach (var pdfObject in operands)
            {
                pdfObject.ToPdf(null, Canvas.InternalBuffer);
                Canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
            }
        }


        //
        // constructor giving the parent a dummy listener to talk to 
        //
        public PdfContentStreamEditor() : base(new DummyRenderListener())
        {
        }

        //
        // constructor giving the parent a dummy listener to talk to 
        //
        public PdfContentStreamEditor(IRenderListener renderListener) : base(renderListener)
        {
        }

        //
        // Overrides of PdfContentStreamProcessor methods
        //

        public override IContentOperator RegisterContentOperator(string operatorString, IContentOperator newOperator)
        {
            var wrapper = new ContentOperatorWrapper();
            wrapper.SetOriginalOperator(newOperator);
            var formerOperator = base.RegisterContentOperator(operatorString, wrapper);
            return (formerOperator is ContentOperatorWrapper operatorWrapper ? operatorWrapper.GetOriginalOperator() : formerOperator);
        }

        public override void ProcessContent(byte[] contentBytes, PdfDictionary resources)
        {
            this.Resources = resources; 
            base.ProcessContent(contentBytes, resources);
            this.Resources = null;
        }

        //
        // members holding the output canvas and the resources
        //
        protected PdfContentByte Canvas = null;
        protected PdfDictionary Resources = null;

        //
        // A content operator class to wrap all content operators to forward the invocation to the editor
        //
        class ContentOperatorWrapper : IContentOperator
        {
            public IContentOperator GetOriginalOperator()
            {
                return _originalOperator;
            }

            public void SetOriginalOperator(IContentOperator op)
            {
                this._originalOperator = op;
            }

            public void Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
            {
                if (_originalOperator != null && !"Do".Equals(oper.ToString()))
                {
                    _originalOperator.Invoke(processor, oper, operands);
                }
                ((PdfContentStreamEditor)processor).Write(processor, oper, operands);
            }

            private IContentOperator _originalOperator = null;
        }

        //
        // A dummy render listener to give to the underlying content stream processor to feed events to
        //
        class DummyRenderListener : IRenderListener
        {
            public void BeginTextBlock() { }

            public void RenderText(TextRenderInfo renderInfo) { }

            public void EndTextBlock() { }

            public void RenderImage(ImageRenderInfo renderInfo) { }
        }
    }
}

TextReplaceStreamEditor

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFCleaner
{
    public class TextReplaceStreamEditor : PdfContentStreamEditor
    {
        public TextReplaceStreamEditor(string MatchPattern, string ReplacePattern)
        {
            _matchPattern = MatchPattern;
            _replacePattern = ReplacePattern;
        }

        private string _matchPattern;
        private string _replacePattern;

        protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
        {
            var operatorString = oper.ToString();
            if ("Tj".Equals(operatorString) || "TJ".Equals(operatorString))
            {
                for(var i = 0; i < operands.Count; i++)
                {
                    if(!operands[i].IsString())
                        continue;

                    var text = operands[i].ToString();
                    if(Regex.IsMatch(text, _matchPattern))
                    {
                        operands[i] = new PdfString(Regex.Replace(text, _matchPattern, _replacePattern));
                    }
                }
            }

            base.Write(processor, oper, operands);
        }
    }
}

TextRedactStreamEditor

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFCleaner
{
    public class TextRedactStreamEditor : PdfContentStreamEditor
    {
        public TextRedactStreamEditor(string MatchPattern) : base(new RedactRenderListener(MatchPattern))
        {
            _matchPattern = MatchPattern;
        }

        private string _matchPattern;

        protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
        {
            base.Write(processor, oper, operands);
        }

        public override void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
        {
            ((RedactRenderListener)base.RenderListener).SetCanvas(canvas);
            base.EditContent(contentBytes, resources, canvas);
        }
    }

    //
    // A pretty simple render listener, all we care about it text stuff.
    // We listen out for text blocks, look for our text, and then put a
    // black box over it.. text 'redacted'
    //
    class RedactRenderListener : IRenderListener
    {
        private PdfContentByte _canvas;
        private string _matchPattern;

        public RedactRenderListener(string MatchPattern)
        {
            _matchPattern = MatchPattern;
        }

        public RedactRenderListener(PdfContentByte Canvas, string MatchPattern)
        {
            _canvas = Canvas;
            _matchPattern = MatchPattern;
        }

        public void SetCanvas(PdfContentByte Canvas)
        {
            _canvas = Canvas;
        }

        public void BeginTextBlock() { }

        public void RenderText(TextRenderInfo renderInfo)
        {
            var text = renderInfo.GetText();

            var match = Regex.Match(text, _matchPattern);
            if(match.Success)
            {
                var p1 = renderInfo.GetCharacterRenderInfos()[match.Index].GetAscentLine().GetStartPoint();
                var p2 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetAscentLine().GetEndPoint();
                var p3 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetDescentLine().GetEndPoint();
                var p4 = renderInfo.GetCharacterRenderInfos()[match.Index].GetDescentLine().GetStartPoint();

                _canvas.SaveState();
                _canvas.SetColorStroke(BaseColor.BLACK);
                _canvas.SetColorFill(BaseColor.BLACK);
                _canvas.MoveTo(p1[Vector.I1], p1[Vector.I2]);
                _canvas.LineTo(p2[Vector.I1], p2[Vector.I2]);
                _canvas.LineTo(p3[Vector.I1], p3[Vector.I2]);
                _canvas.LineTo(p4[Vector.I1], p4[Vector.I2]);
                _canvas.ClosePathFillStroke();
                _canvas.RestoreState();
            }
        }

        public void EndTextBlock() { }

        public void RenderImage(ImageRenderInfo renderInfo) { }
    }
}

Using them with iTextSharp

var reader = new PdfReader("SRC FILE PATH GOES HERE");
var dstFile = File.Open("DST FILE PATH GOES HERE", FileMode.Create);

pdfStamper = new PdfStamper(reader, output, reader.PdfVersion, false);

// We don't need to auto-rotate, as the PdfContentStreamEditor will already deal with pre-rotated space..
// if we enable this we will inadvertently rotate the content.
pdfStamper.RotateContents = false;

// This is for the Text Replace
var replaceTextProcessor = new TextReplaceStreamEditor(
    "TEXT TO REPLACE HERE",
    "TEXT TO SUBSTITUTE IN HERE");

for(int i=1; i <= reader.NumberOfPages; i++)
    replaceTextProcessor.EditPage(pdfStamper, i);


// This is for the Text Redact
var redactTextProcessor = new TextRedactStreamEditor(
    "TEXT TO REDACT HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
    redactTextProcessor.EditPage(pdfStamper, i);
// Since our redacting just puts a box over the top, we should secure the document a bit... just to prevent people copying/pasting the text behind the box.. we also prevent text to speech processing of the file, otherwise the 'hidden' text will be spoken
pdfStamper.Writer.SetEncryption(null, 
    Encoding.UTF8.GetBytes("ownerPassword"),
    PdfWriter.AllowDegradedPrinting | PdfWriter.AllowPrinting,
    PdfWriter.ENCRYPTION_AES_256);

// hey, lets get rid of Javascript too, because it's annoying
pdfStamper.Javascript = "";


// and then finally we close our files (saving it in the process) 
pdfStamper.Close();
reader.Close();
BevanWeiss
  • 135
  • 1
  • 15
  • 1
    *"This is possible in a limited fashion with the use of iText / iTextSharp. It will only work with Tj/TJ opcodes (i.e. standard text, not text embedded in images, or drawn with shapes)."* - There are additional limitations: In particular the code assumes that the strings are encoded in some ASCII'ish fashion which need not be true, you'll actually find a lot of documents in the wild the fonts of which use some ad-hoc generated encoding. Furthermore, you assume fonts to be complete enough to allow replacement. But many documents nowadays only contain font subsets of the actually used glyphs. – mkl Mar 26 '18 at 12:48
  • Other restrictions are more obvious, e.g. the code assumes the whole match to be inside a single strings. That been said, though, there still are many PDF generators creating simple PDFs within these limitations. Thus, if you're sure about your input files, you can indeed edit or remove text this way. – mkl Mar 26 '18 at 12:51
  • So how does this work in iText7? – test Feb 01 '23 at 18:28
0

You can use GroupDocs.Redaction (available for .NET) for replacing or removing the text from PDF documents. You can perform the exact phrase, case-sensitive and regular expression redaction (removal) of the text. The following code snippet replaces the word "candy" with "[redacted]" in the loaded PDF document.

C#:

using (Document doc = Redactor.Load("D:\\candy.pdf"))
{
     doc.RedactWith(new ExactPhraseRedaction("candy", new ReplacementOptions("[redacted]")));
     // Save the document to "*_Redacted.*" file.
     doc.Save(new SaveOptions() { AddSuffix = true, RasterizeToPDF = false }); 
}

Disclosure: I work as Developer Evangelist at GroupDocs.

Usman Aziz
  • 100
  • 3
  • On https://products.groupdocs.com/ it [looks like redaction is only available for .Net](https://i.stack.imgur.com/AQ8cs.png). One finds the page for the Java product by obvious URL manipulation, though, but no download. – mkl May 23 '19 at 10:10
  • Also [no Redaction version](https://i.stack.imgur.com/jVqJP.png) on https://repository.groupdocs.com/repo/com/groupdocs/ – mkl May 23 '19 at 10:15
  • @mkl, Pardon me for the confusion caused. I have updated the answer and the Java version of GroupDocs.Redaction is coming soon. – Usman Aziz May 24 '19 at 05:15