1

I'd like to filter RENDER_TEXT events as they are written to an output file. I have a PDF that has some text in it that I want filtered out. I've found that I can walk the document once and determine the characteristics of the render events that I want to filter. Now I'd like to copy the pages of the source document and skip over some RENDER_TEXT events so that the text does not appear in the destination document. I have an IEventFilter that will accept the correct events. I just need to know how to put this filter on the document writer.

The goal is to take a PDF created from Google Calendar in the Agenda format and remove the lines "Created by:" and "Calendar:". These lines are typically made up of 3 RENDER_TEXT events.

My current code is below. I have found that all RENDER_TEXT events with the same y-coordinate for the baseline will identify the events that I want to remove.

import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import com.itextpdf.kernel.geom.LineSegment;
import com.itextpdf.kernel.geom.PageSize;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfPage;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.kernel.pdf.canvas.parser.EventType;
import com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor;
import com.itextpdf.kernel.pdf.canvas.parser.data.IEventData;
import com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo;
import com.itextpdf.kernel.pdf.canvas.parser.filter.IEventFilter;
import com.itextpdf.kernel.pdf.canvas.parser.listener.IEventListener;

public class Main {

    private static final Logger LOGGER = LogManager.getLogger();

    public static void main(String[] args) throws FileNotFoundException, IOException {
        final Path src = Paths.get("calendar_2018-08-04_2018-08-19.pdf");
        final Path dest = Paths.get("/home/jpschewe/Downloads/calendar_clean.pdf");

        final Main app = new Main(src, dest);

    }

    private Main(final Path src, final Path dest) throws FileNotFoundException, IOException {

        try (PdfDocument srcDoc = new PdfDocument(new PdfReader(src.toFile()));
                PdfDocument destDoc = new PdfDocument(new PdfWriter(dest.toFile()))) {
            final Rectangle pageSize = srcDoc.getFirstPage().getPageSize();

            for (int i = 1; i <= srcDoc.getNumberOfPages(); ++i) {
                PdfPage page = srcDoc.getPage(i);

                final GatherBaselines gatherBaselines = new GatherBaselines();
                final PdfCanvasProcessor processor = new PdfCanvasProcessor(gatherBaselines);
                processor.processPageContent(page);

                LOGGER.info("Filter baselines for page {} -> {}", i, gatherBaselines.baselinesToFilter);

                destDoc.setDefaultPageSize(new PageSize(pageSize));
                destDoc.addNewPage();
            }

        }
    }

    public class FilterEventsByBaseline implements IEventFilter {
        private final List<Float> baselinesToFilter;

        public FilterEventsByBaseline(final List<Float> baselinesToFilter) {
            this.baselinesToFilter = baselinesToFilter;
        }

        @Override
        public boolean accept(final IEventData data, final EventType type) {
            if (type.equals(EventType.RENDER_TEXT)) {
                final TextRenderInfo renderInfo = (TextRenderInfo) data;
                final LineSegment baseline = renderInfo.getBaseline();
                final float checkY = baseline.getStartPoint().get(1);

                final boolean filter = baselinesToFilter.stream().anyMatch(f -> Math.abs(checkY - f) < 1E-6);
                return !filter;
            }

            return true;

        }
    }

    public class GatherBaselines implements IEventListener {

        // need to store all baselines that are problems
        // the assumption is that all RENDER_TEXT operations with a baseline in the bad
        // list need to be filtered when copying pages
        private final List<Float> baselinesToFilter = new LinkedList<>();

        @Override
        public void eventOccurred(final IEventData data, final EventType type) {
            if (type.equals(EventType.RENDER_TEXT)) {
                final TextRenderInfo renderInfo = (TextRenderInfo) data;

                final String text = renderInfo.getText();
                final LineSegment baseline = renderInfo.getBaseline();
                if (null != text && (text.contains("Calendar:") || text.contains("Created by:"))) {
                    // index 1 is the y coordinate
                    baselinesToFilter.add(baseline.getStartPoint().get(1));
                }
            }

        }

        @Override
        public Set<EventType> getSupportedEvents() {
            return Collections.singleton(EventType.RENDER_TEXT);
        }

    }

}

Thank you

Jon
  • 61
  • 8
  • `RENDER_TEXT` events are emitted during document content parsing, while writing is done via `PdfCanvas`. Those are different things. If you want to remove some content take a look at the [pdfSweep](https://itextpdf.com/itext7/pdfSweep) add-on. It supports many use cases but of course it depends on what exactly you want to remove. – Alexey Subach Aug 05 '18 at 09:13
  • In any case, please show us what have you tried and put the goal more clearly (what exactly you want to filter out and how you decide whether to filter it out or not). Until then I vote to close this question. – Alexey Subach Aug 05 '18 at 09:14
  • If you take all the information from the parsing events, you can reconstruct quite a lot of content of documents. But that's not what the parsing events originally were designed for, so some details probably won't be easy to reproduce. – mkl Aug 05 '18 at 10:22
  • The `PdfCanvasEditor` from [this answer](https://stackoverflow.com/a/40999180/1729265) might also be a base for implementing your task. – mkl Aug 05 '18 at 10:44
  • I've added sample code and more detail to my question. – Jon Aug 05 '18 at 13:01

1 Answers1

1

As proposed in a comment, you can use the PdfCanvasEditor from this answer to filter the operations as desired from the content streams. Actually I slightly extended that class a bit to be able to properly support the ' and " text drawing operators. You find that class here.

Just like in your approach the lines to clear are determined in a first run: I used a RegexBasedLocationExtractionStrategy instance for this.

Thereafter, in the PdfCanvasEditor step, instructions drawing text on those lines are changed to only draw empty strings.

As not the events you inspected cause the text to be drawn here, though, but more basic operator and operand structures, the exact mechanics are not derived from an IEventFilter. But the mechanics are similar to your approach.

try (PdfDocument pdfDocument = new PdfDocument(SOURCE_PDF_READER, TARGET_PDF_WRITER)) {
    List<Rectangle> triggerRectangles = new ArrayList<>();

    PdfCanvasEditor editor = new PdfCanvasEditor()
    {
        {
            Field field = PdfCanvasProcessor.class.getDeclaredField("textMatrix");
            field.setAccessible(true);
            textMatrixField = field;
        }

        @Override
        protected void nextOperation(PdfLiteral operator, List<PdfObject> operands) {
            try {
                recentTextMatrix = (Matrix)textMatrixField.get(this);
            } catch (IllegalArgumentException | IllegalAccessException e) {
                throw new RuntimeException(e);
            }
        }

        @Override
        protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
        {
            String operatorString = operator.toString();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            {
                Matrix matrix = null;
                try {
                    matrix = recentTextMatrix.multiply(getGraphicsState().getCtm());
                } catch (IllegalArgumentException e) {
                    throw new RuntimeException(e);
                }
                float y = matrix.get(Matrix.I32);
                if (triggerRectangles.stream().anyMatch(rect -> rect.getBottom() <= y && y <= rect.getTop())) {
                    if ("TJ".equals(operatorString))
                        operands.set(0, new PdfArray());
                    else
                        operands.set(operands.size() - 2, new PdfString(""));
                }
            }

            super.write(processor, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
        final Field textMatrixField;
        Matrix recentTextMatrix;
    };

    for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
    {
        PdfPage page = pdfDocument.getPage(i);
        Set<PdfName> xobjectNames = page.getResources().getResourceNames(PdfName.XObject);
        for (PdfName xobjectName : xobjectNames) {
            PdfFormXObject xobject = page.getResources().getForm(xobjectName);
            byte[] content = xobject.getPdfObject().getBytes();
            PdfResources resources = xobject.getResources();

            RegexBasedLocationExtractionStrategy regexLocator = new RegexBasedLocationExtractionStrategy("Created by:|Calendar:");
            new PdfCanvasProcessor(regexLocator).processContent(content, resources);
            triggerRectangles.clear();
            triggerRectangles.addAll(regexLocator.getResultantLocations().stream().map(loc -> loc.getRectangle()).collect(Collectors.toSet()));

            PdfCanvas pdfCanvas = new PdfCanvas(new PdfStream(), resources, pdfDocument);
            editor.editContent(content, resources, pdfCanvas);
            xobject.getPdfObject().setData(pdfCanvas.getContentStream().getBytes());
        }
    }
}

(EditPageContent test testRemoveSpecificLinesCalendar)


Beware, this is a proof-of-concept, and it is particularly customized for the OP's use case: The PdfCanvasEditor here only is used to inspect and edit the first level form XObjects of each page because PDFs created from Google Calendar in the Agenda format contain all their page content in a form XObject which in turn is drawn in the page content stream. Furthermore text is expected to occur parallel to the top of the page.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you for this solution. It's unfortunate that you need to use reflection to access things not normally visible in iText7. – Jon Aug 18 '18 at 14:22
  • Yes, that's an evil one sees in a lot of libraries, fields or methods made private (or same-package-only) even though it would make a lot of sense for them to be at least protected or even public. – mkl Aug 18 '18 at 19:35