6

I can extract text from pages in a PDF in many ways:

String pageText = PdfTextExtractor.GetTextFromPage(reader, i);

This can be used to get any text on a page.

Alternatively:

byte[] contentBytes = iTextSharp.text.pdf.parser.ContentByteUtils.GetContentBytesForPage(reader, i);

Possibilities are endless.

Now I want to remove/redact a certain word, e.g. explicit words, sensitive information (putting black boxes over them obviously is a bad idea :) or whatever from the PDF (which is simple and text only). I can find that word just fine using the approach above. I can count its occurrences etc...

I do not care about layout, or the fact that PDF is not really meant to be manipulated in this way.

I just wish to know if there is a mechanism that would allow me to manipulate the raw content of my PDF in this way. You could say I'm looking for "SetContentBytesForPage()" ...

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
Kris
  • 2,100
  • 5
  • 31
  • 50

3 Answers3

6

If you want to change the content of a page, it isn't sufficient to change the content stream of a page. A page may contain references to Form XObjects that contain content that you want to remove.

A secondary problem consists of images. For instance: suppose that your document consists of a scanned document that has been OCR'ed. In that case, it isn't sufficient to remove the (vector) text, you'll also need to manipulate the (pixel) text in the image.

Assuming that your secondary problem doesn't exist, you'll need a double approach:

  1. get the content from the page as text to detect in which pages there are names or words you want to remove.
  2. recursively loop over all the content streams to find that text and to rewrite those content streams without that text.

From your question, I assume that you have already solved problem 1. Solving problem 2 isn't that trivial. In chapter 15 of my book, I have an example where extracting text returns "Hello World", but when you look inside the content stream, you see:

BT
/F1 12 Tf
88.66 367 Td
(ld) Tj
-22 0 Td
(Wor) Tj
-15.33 0 Td
(llo) Tj
-15.33 0 Td
(He) Tj
ET

Before you can remove "Hello World" from this stream snippet, you'll need some heuristics so that your program recognizes the text in this syntax.

Once you've found the text, you need to rewrite the stream. For inspiration, you can take a look at the OCG remover functionality in the itext-xtra package.

Long story short: if your PDFs are relatively simple, that is: the text can be easily detected in the different content stream (page content and Form XObject content), then it's simply a matter of rewriting those streams after some string manipulations.

I've made you a simple example named ReplaceStream that replaces "Hello World" with "HELLO WORLD" in a PDF.

public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfDictionary dict = reader.getPageN(1);
    PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
    if (object instanceof PRStream) {
        PRStream stream = (PRStream)object;
        byte[] data = PdfReader.getStreamBytes(stream);
        stream.setData(new String(data).replace("Hello World", "HELLO WORLD").getBytes());
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();
    reader.close();
}

Some caveats:

  • I check if object is a stream. It could also be an array of streams. In that case, you need to loop over that array.
  • I don't check if there are form XObjects defined for the page.
  • I assume that Hello World can be easily detected in the PDF Syntax.
  • ...

In real life, PDFs are never that simple and the complexity of your project will increase dramatically with every special feature that is used in your documents.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Hi Bruno, thank you for this update. I did a lot of research already (frequently bumping into posts by you btw :) and I do realize how complex PDFs can be. Luckily for me, most of the occurrences of offending strings in my PDFs look lik this: ... (STRINGTOREMOVE) Tj ... I checked this by looking at the raw objects in PDFVole. That is why I figured it might be relatively simple to get rid of the string entirely. However, the "proper way" to have "write" access to the content stream eludes me for the iTextSharp API. I can read everything but modifying and writing back eludes me... – Kris Feb 07 '14 at 12:21
  • OK, give me a moment. I'll make you an example (in Java, but it will give you a place to start). – Bruno Lowagie Feb 07 '14 at 12:22
  • I've adapted my answer. Note that you can also look inside a PDF using iText RUPS: http://itextpdf.com/product/itext_rups – Bruno Lowagie Feb 07 '14 at 12:43
  • Nice! I somehow completely overlooked the fact that you can call SetData() on PRStream... My bad... I'll add the C# equivalent below. – Kris Feb 07 '14 at 17:03
  • Thanks, I upvoted it. I'll use it for further reference. – Bruno Lowagie Feb 07 '14 at 17:38
  • One more minor snag I ran into: As you already suspected, the PdfObject being returned dict.getDirectObject(PdfName.CONTENTS) is indeed an array IsArray() returns true. Stepping through my code I can see that these arrays contain IndReference objects (e.g. 1422 0 R). Using PDFVole I can see that all the referenced objects in the array are indeed in my PDF and some are indeed Streams I am interested in manipulating. I'm not quite sure how I can actually "Follow" an Indirect Reference to get the actual Stream object but I'll try to figure it out and update the info below accordingly... – Kris Feb 07 '14 at 18:07
  • *I'm not quite sure how I can actually "Follow" an Indirect Reference to get the actual Stream object* - use the static helper method `PdfReader.getPdfObject`. – mkl Feb 07 '14 at 19:55
  • Hello, I've started writing "The ABC of PDF" that explains how to get specific objects, even if it are indirect references. You can find this book here for free: https://leanpub.com/itext_pdfabc/ It isn't finished yet, but you have to register to get it and you'll get a mail when it's updated. – Bruno Lowagie Feb 08 '14 at 11:29
  • @BrunoLowagie, how can i replace another string when i don't know its representation. – Mihai Alexandru-Ionut May 25 '17 at 13:38
2

The C# equivalent of the code by Bruno:

static void manipulatePdf(String src, String dest)
    {
        PdfReader reader = new PdfReader(src);
        PdfDictionary dict = reader.GetPageN(1);
        PdfObject pdfObject = dict.GetDirectObject(PdfName.CONTENTS);
        if (pdfObject.IsStream()) {
            PRStream stream = (PRStream)pdfObject;
            byte[] data = PdfReader.GetStreamBytes(stream);
            stream.SetData(System.Text.Encoding.ASCII.GetBytes(System.Text.Encoding.ASCII.GetString(data).Replace("Hello World", "HELLO WORLD")));
        }
        FileStream outStream = new FileStream(dest, FileMode.Create);
        PdfStamper stamper = new PdfStamper(reader, outStream);
        reader.Close();
    }

I'll update this if it would turn out to still contain errors.

Kris
  • 2,100
  • 5
  • 31
  • 50
0

In follow-up to my previous C# code and the remark by Bruno that GetDirectObject(PdfName.CONTENTS) might as well return an array as opposed to a stream: In my particular case, this turned out to be true.

The PdfObject returned returned "true" for IsArray(). I checked and the array elements were all PdfIndirectReference.

A further look at the API yielded two useful bits of info:

  1. PdfIndirectReference had a "Number" property, leading you to another PdfObject.
  2. You can get to the referenced object using reader.GetPdfObject(int ref), where ref is the "Number" property of the IndirectReferenceObject

From there on out, you get a new PdfObject that you can check using IsStream() and modify as per the previously posted code.

So it works out to this (mind you, this is quick and dirty, but it works for my particular purposes...):

      // Get the contents of my page...
      PdfObject pdfObject = pageDict.GetDirectObject(PdfName.CONTENTS);

      // Check that this is, in fact, an array or something else...
      if (pdfObject.IsArray())
      {
          PdfArray streamArray = pageDict.GetAsArray(PdfName.CONTENTS);

          for (int j = 0; j < streamArray.Size; j++)
             {
                  PdfIndirectReference arrayEl = (PdfIndirectReference)streamArray[j];

                  PdfObject refdObj = reader.GetPdfObject(arrayEl.Number);

                  if (refdObj.IsStream())
                     {
                        PRStream stream = (PRStream)refdObj;
                        byte[] data = PdfReader.GetStreamBytes(stream);
                        stream.SetData(System.Text.Encoding.ASCII.GetBytes(System.Text.Encoding.ASCII.GetString(data).Replace(targetedText, newText)));
                     }
              }

       }
Kris
  • 2,100
  • 5
  • 31
  • 50