1

I have PDF:s with a extremely large tokens plastered across the entire front page of many pdf documents, see image. I'm looking for an automated method to remove these.

Apache PDFBox has a pretty extensive API, is there any way to match these tokens by Regex and simply remove them and re-save the pdf?

Image from PDF Example posted below. The tokens I'd like to remove are: [KS/2019:589] LokalvÄrd Grundskolor & Idrottshallar that are plastered on top of the regular text. Google Drive link to full PDF-file. pdf example

Mountain_sheep
  • 311
  • 2
  • 16

1 Answers1

1

You can use the PdfContentStreamEditor class from this answer (don't forget to apply the fix mentioned at the bottom of the answer) like this:

try (   PDDocument document = ...   ) {
    PDPage page = document.getPage(0);
    PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
            String operatorString = operator.getName();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            {
                float fs = getGraphicsState().getTextState().getFontSize();
                Matrix matrix = getTextMatrix().multiply(getGraphicsState().getCurrentTransformationMatrix());
                Point2D.Float transformedFsVector = matrix.transformPoint(0, fs);
                Point2D.Float transformedOrigin = matrix.transformPoint(0, 0);
                double transformedFs = transformedFsVector.distance(transformedOrigin);
                if (transformedFs > 50)
                    return;
            }

            super.write(contentStreamWriter, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    editor.processPage(page);
    document.save(...);
}

(EditPageContent test testRemoveBigTextKommersAnnonsElite)

You can find some explanations in the referenced answer.

mkl
  • 90,588
  • 15
  • 125
  • 265