1

I need to analyze path data of PDF files and manipulate content with iText 7. Manipulations include deletion/replacemant and coloring.

I can analyze the graphics alright with something like the following code:

public class ContentParsing {
    public static void main(String[] args) throws IOException {
        new ContentParsing().inspectPdf("testdata/test.pdf");
    }

    public void inspectPdf(String path) throws IOException {
        File file = new File(path);
        PdfDocument pdf = new PdfDocument(new PdfReader(file.getAbsolutePath()));
        PdfDocumentContentParser parser = new PdfDocumentContentParser(pdf);
        for (int i=1; i<=pdf.getNumberOfPages(); i++) {
            parser.processContent(i, new PathEventListener());
        }
        pdf.close();
    }
}


public class PathEventListener implements IEventListener {
    public void eventOccurred(IEventData eventData, EventType eventType) {
        PathRenderInfo pathRenderInfo = (PathRenderInfo) eventData;
        for ( Subpath subpath : pathRenderInfo.getPath().getSubpaths() ) {
            for ( IShape segment : subpath.getSegments() ) {
                // Here goes some path analysis code
                System.out.println(segment.getBasePoints());
            }
        }
    }

    public Set<EventType> getSupportedEvents() {
        Set<EventType> supportedEvents = new HashSet<EventType>();
        supportedEvents.add(EventType.RENDER_PATH);
        return supportedEvents;
    }
}

Now, what's the way to go with manipulating things and writing them back to the PDF? Do I have to construct an entirely new PDF document and copy everything over (in manipulated form), or can I somehow manipulate the read PDF data directly?

Thomas W
  • 14,757
  • 6
  • 48
  • 67
  • Creating a new pdf and adding the modified content is probably the best way to go and puts you in complete control. Modifying an existing pdf is technically possible, and some tasks like adding content over/under existing content or highlighting using a different colour are quite easy using iText. Others, especially things like text-replacement or search contain a good number of pitfalls and are technically hard. I'd recommend to have a look at http://developers.itextpdf.com/ and browse through some examples and tutorials to see what's possible. – Samuel Huylebroeck Dec 05 '16 at 12:35

1 Answers1

3

Now, what's the way to go with manipulating things and writing them back to the PDF? Do I have to construct an entirely new PDF document and copy everything over (in manipulated form), or can I somehow manipulate the read PDF data directly?

In essence you are looking for a class which is not merely parsing a PDF content stream and signaling the instructions in it like the PdfCanvasProcessor (the PdfDocumentContentParser you use is merely a very thin wrapper for PdfCanvasProcessor) but which also creates the content stream anew with the instructions you forward back to it.

A generic content stream editor class

For iText 5.5.x a proof-of-concept for such a content stream editor class can be found in this answer (the Java version is a bit further down in the answer text).

This is a port of that proof-of-concept to iText 7:

public class PdfCanvasEditor extends PdfCanvasProcessor
{
    /**
     * This method edits the immediate contents of a page, i.e. its content stream.
     * It explicitly does not descent into form xobjects, patterns, or annotations.
     */
    public void editPage(PdfDocument pdfDocument, int pageNumber) throws IOException
    {
        if ((pdfDocument.getReader() == null) || (pdfDocument.getWriter() == null))
        {
            throw new PdfException("PdfDocument must be opened in stamping mode.");
        }

        PdfPage page = pdfDocument.getPage(pageNumber);
        PdfResources pdfResources = page.getResources();
        PdfCanvas pdfCanvas = new PdfCanvas(new PdfStream(), pdfResources, pdfDocument);
        editContent(page.getContentBytes(), pdfResources, pdfCanvas);
        page.put(PdfName.Contents, pdfCanvas.getContentStream());
    }

    /**
     * This method processes the content bytes and outputs to the given canvas.
     * It explicitly does not descent into form xobjects, patterns, or annotations.
     */
    public void editContent(byte[] contentBytes, PdfResources resources, PdfCanvas canvas)
    {
        this.canvas = canvas;
        processContent(contentBytes, resources);
        this.canvas = null;
    }

    /**
     * <p>
     * This method writes content stream operations to the target canvas. The default
     * implementation writes them as they come, so it essentially generates identical
     * copies of the original instructions the {@link ContentOperatorWrapper} instances
     * forward to it.
     * </p>
     * <p>
     * Override this method to achieve some fancy editing effect.
     * </p> 
     */
    protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
    {
        PdfOutputStream pdfOutputStream = canvas.getContentStream().getOutputStream();
        int index = 0;

        for (PdfObject object : operands)
        {
            pdfOutputStream.write(object);
            if (operands.size() > ++index)
                pdfOutputStream.writeSpace();
            else
                pdfOutputStream.writeNewLine();
        }
    }

    //
    // constructor giving the parent a dummy listener to talk to 
    //
    public PdfCanvasEditor()
    {
        super(new DummyEventListener());
    }

    //
    // Overrides of PdfContentStreamProcessor methods
    //
    @Override
    public IContentOperator registerContentOperator(String operatorString, IContentOperator operator)
    {
        ContentOperatorWrapper wrapper = new ContentOperatorWrapper();
        wrapper.setOriginalOperator(operator);
        IContentOperator formerOperator = super.registerContentOperator(operatorString, wrapper);
        return formerOperator instanceof ContentOperatorWrapper ? ((ContentOperatorWrapper)formerOperator).getOriginalOperator() : formerOperator;
    }

    //
    // members holding the output canvas and the resources
    //
    protected PdfCanvas canvas = null;

    //
    // A content operator class to wrap all content operators to forward the invocation to the editor
    //
    class ContentOperatorWrapper implements IContentOperator
    {
        public IContentOperator getOriginalOperator()
        {
            return originalOperator;
        }

        public void setOriginalOperator(IContentOperator originalOperator)
        {
            this.originalOperator = originalOperator;
        }

        @Override
        public void invoke(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
        {
            if (originalOperator != null && !"Do".equals(operator.toString()))
            {
                originalOperator.invoke(processor, operator, operands);
            }
            write(processor, operator, operands);
        }

        private IContentOperator originalOperator = null;
    }

    //
    // A dummy event listener to give to the underlying canvas processor to feed events to
    //
    static class DummyEventListener implements IEventListener
    {
        @Override
        public void eventOccurred(IEventData data, EventType type)
        { }

        @Override
        public Set<EventType> getSupportedEvents()
        {
            return null;
        }
    }
}

(PdfCanvasEditor.java)

The explanations from the iText 5 answer still apply, the parsing framework has not changed much from iText 5.5.x to iText 7.0.x.

Usage examples

Unfortunately you wrote in very vague terms about how exactly you want to change the contents. Thus I simply ported some iText 5 samples which made use of the original iText 5 content stream editor class:

Watermark removal

These are ports of the use cases in this answer.

testRemoveBoldMTTextDocument

This example drops all text written in a font the name of which ends with "BoldMT":

try (   InputStream resource = getClass().getResourceAsStream("document.pdf");
        PdfReader pdfReader = new PdfReader(resource);
        OutputStream result = new FileOutputStream(new File(RESULT_FOLDER, "document-noBoldMTText.pdf"));
        PdfWriter pdfWriter = new PdfWriter(result);
        PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
    PdfCanvasEditor editor = new PdfCanvasEditor()
    {

        @Override
        protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
        {
            String operatorString = operator.toString();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            {
                if (getGraphicsState().getFont().getFontProgram().getFontNames().getFontName().endsWith("BoldMT"))
                    return;
            }
            
            super.write(processor, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
    {
        editor.editPage(pdfDocument, i);
    }
}

(EditPageContent.java test method testRemoveBoldMTTextDocument)

testRemoveBigTextDocument

This example drops all text written with a large font size:

try (   InputStream resource = getClass().getResourceAsStream("document.pdf");
        PdfReader pdfReader = new PdfReader(resource);
        OutputStream result = new FileOutputStream(new File(RESULT_FOLDER, "document-noBigText.pdf"));
        PdfWriter pdfWriter = new PdfWriter(result);
        PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
    PdfCanvasEditor editor = new PdfCanvasEditor()
    {

        @Override
        protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
        {
            String operatorString = operator.toString();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            {
                if (getGraphicsState().getFontSize() > 100)
                    return;
            }
            
            super.write(processor, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
    {
        editor.editPage(pdfDocument, i);
    }
}

(EditPageContent.java test method testRemoveBigTextDocument)

Text color change

This is a port of the use case in this answer.

testChangeBlackTextToGreenDocument

This example changes the color of black text to green.

try (   InputStream resource = getClass().getResourceAsStream("document.pdf");
        PdfReader pdfReader = new PdfReader(resource);
        OutputStream result = new FileOutputStream(new File(RESULT_FOLDER, "document-blackTextToGreen.pdf"));
        PdfWriter pdfWriter = new PdfWriter(result);
        PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
    PdfCanvasEditor editor = new PdfCanvasEditor()
    {

        @Override
        protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
        {
            String operatorString = operator.toString();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            {
                if (currentlyReplacedBlack == null)
                {
                    Color currentFillColor = getGraphicsState().getFillColor();
                    if (Color.BLACK.equals(currentFillColor))
                    {
                        currentlyReplacedBlack = currentFillColor;
                        super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(1), new PdfNumber(0), new PdfLiteral("rg")));
                    }
                }
            }
            else if (currentlyReplacedBlack != null)
            {
                if (currentlyReplacedBlack instanceof DeviceCmyk)
                {
                    super.write(processor, new PdfLiteral("k"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfNumber(1), new PdfLiteral("k")));
                }
                else if (currentlyReplacedBlack instanceof DeviceGray)
                {
                    super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
                }
                else
                {
                    super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfLiteral("rg")));
                }
                currentlyReplacedBlack = null;
            }

            super.write(processor, operator, operands);
        }

        Color currentlyReplacedBlack = null;

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
    {
        editor.editPage(pdfDocument, i);
    }
}

(EditPageContent.java test method testChangeBlackTextToGreenDocument)

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Wow, a very thorough answer! I'll have to take some time to go through it. Many thanks! – Thomas W Dec 06 '16 at 15:43
  • Just to make the usage examples complete - it seems there should be `pdfDocument.close()` after the `for` loops in the three usage examples, right? At least I will only get an empty file if I don't add it. Or is this an issue only with Java 1.8? – Thomas W Dec 19 '16 at 07:03
  • 1
    @Thomas the `PdfDocument pdfDocument` instances in the examples are automatically closed as they are defined accordingly `try ( HERE ) {...}`. – mkl Dec 19 '16 at 08:19
  • Ah, I see. Eclipse complained about the try-with-resources so I removed it to get a basic version to tinker with. Didn't do much Java work before - I recognize in Eclipse I have to right click the project, choose Properties/Java Compiler and set "Compiler compliance settings" to "1.7". – Thomas W Dec 19 '16 at 09:16
  • How is your PdfCanvasProcessor class licensed? I don't see a license on the [GitHub repo](https://github.com/mkl-public/testarea-itext7/blob/master/src/main/java/mkl/testarea/itext7/content/PdfCanvasEditor.java) where you published it. – Thomas W Nov 27 '20 at 08:22
  • Essentially i published it here on stackoverflow, on github merely is a completed copy including imports etc. Thus, stackoverflow-derived licensing applies. – mkl Nov 27 '20 at 08:46