1

I have a pdf from which I wish to remove all the image and other drawing content from it. and save the resultant as a new pdf.

I know how to remove text by using TJ , Tj operators , which I currently perform as below

op.getOperation().equals( "TJ")

Instead of removing the TJ,Tj operators , Is it possible to copy these Text operators onto an other pdf file with formatting intact so that the new pdf turns out to be pure text only pdf ? Its ok if text drawn using other than Tj , TJ operator misses out.

Code to remove TJ,Tj is taken from THIS stackoverflow post. But it partially works , it just removes images only, leaving drawing and other art intact.

EDIT : Other option I can think of is to set the cmyk color of all other operators outside the BT ET block to white. this way the pdf would feel text only. Is this possible ? If yes, Please support with code examples in pdfBox.

Community
  • 1
  • 1
hussainb
  • 1,218
  • 2
  • 15
  • 33
  • How complex are the page contents? I.e are there form xobjects which can in turn contain mixed text and image data and, therefore, also have to be treated? – mkl Mar 01 '14 at 08:28

1 Answers1

1

... THIS stackoverflow post. But it partially works , it just removes images only, leaving drawing and other art intact.

The main source of graphics other than bitmap graphics is vector graphics. They usually consist of path definitions followed by commands filling or stroking the path.

To remove these graphics you can improve the sample from the answer you referred to by additionally replacing those path striking or filling operators by the n operator which is a path-painting no-op.

            if( token instanceof PDFOperator )
            {
                PDFOperator op = (PDFOperator)token;
                if( op.getOperation().equals( "Do") )
                {
                    //remove the one argument to this operator
                    COSName name = (COSName)newTokens.remove( newTokens.size() -1 );
                    continue;
                }
                else if (PAINTING_PATH_OPS.contains(op.getOperation()))
                {
                    // replace path painting operator by path no-op
                    token = PDFOperator.getOperator("n");
                }
            }

where

final static List<String> PAINTING_PATH_OPS = Arrays.asList("S", "s", "F", "f", "f*", "B", "b", "B*", "b*");

contains the path striking or filling operators.

PS: The image removal code used in that referred-to answer has two drawbacks:

  • It removes too much because it not only removes image xobjects but also form xobjects; sometimes (especially in n-up tool outputs) all content resides inside such form xobjects, including all text.

    To fix this you have to check the type of the referred-to xobject and only remove it if it has sub-type image. As form xobjects in turn can also contain images, you have to recurse into the form xobject (which has a content stream of its own).

  • It removes too little because it ignores inlined images.

    To fix this you also have to look out for BIKey-value pairsIDImage dataEI sections in the content and remove them.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank You for such a detailed answer. I have read the pdfReference.pdf many number of times but managed to completely ignore the "n" operator . Always wondered why removing the path operators left the pdf corrupt. As mentioned by you, My pdf doesnt contain any form objects but your highlight about it will come handy ahead. – hussainb Mar 02 '14 at 07:32
  • mkl, really amazing to know that you are part of stackoverflow from more than a year, not even a single question asked. Answered umpteen number of times, superior answers . really appreciate your work. – hussainb Mar 02 '14 at 07:37
  • 1
    Well, i asked a question on information security ;) anyways, i learned quite a lot by answering questions: i often have had an idea how to solve the issue but working out details gave interesting insights. – mkl Mar 02 '14 at 10:27