1

I'm using PDFBox to extract text from a document by extending PDFTextStripper. I've noticed that some of these documents contain invisible characters that are being extracted. I'd like to filter out these invisible characters.

I see that there are already some stackoverflow posts on this, for example:

I tried subclassing the PDFVisibleTextStripper class found here:

However, I found that this filtered out text that was in fact visible. I used it as a drop-in-replacement for PDFTextStripper.

package com.example.foo;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;

public class ExtractChars extends PDFVisibleTextStripper {
  Processor processor;

  public static void extract(PDDocument document, Processor processor) throws IOException {
    ExtractChars instance = new ExtractChars();

    instance.processor = processor;
    instance.setSortByPosition(true);
    instance.setStartPage(0);
    instance.setEndPage(document.getNumberOfPages());

    ByteArrayOutputStream stream = new ByteArrayOutputStream();
    Writer streamWriter = new OutputStreamWriter(stream);

    instance.writeText(document, streamWriter);
  }

  ExtractChars() throws IOException {}

  protected void writeString(String _string, List<TextPosition> textPositions) throws IOException {
    for (TextPosition text: textPositions) {
      float height = text.getHeightDir();
      String character = text.getUnicode();

      int pageIndex = getCurrentPageNo() - 1;
      float left = text.getXDirAdj();
      float right = left + text.getWidthDirAdj();
      float bottom = text.getYDirAdj();
      float top = bottom - height;

      BoundingBox box = new BoundingBox(pageIndex, left, right, top, bottom);

      this.processor.process(character, box);
    }
  }

  public interface Processor {
    void process(String character, BoundingBox box);
  }
}

I don't know if there's anything I need to change in my subclass to make this work correctly. I can provide a PDF that exhibits this behaviour if that would be helpful, although it contains sensitive content so I'd need to remove that first.

Instead, I have created a minimal example (below) that exhibits the 'invisible text' behaviour that I am seeing. The bulleted list contains an item at the end '24. a.' that can be highlighted in a PDF viewer such as macOS Preview and copy-pasted out.

This 'a.' is currently being extracted by PDFTextStripper and I'd like it not to be. I don't really understand why this is happening. My guess would be it's to do with clipping but I'd be really grateful if someone could explain what's going on.

My end goal is to filter these characters out so if you have suggestions for how I could handle this specific case in the simplest possible way, that would be appreciated. I don't think I need all of the general methods in PDFVisibleTextStripper.

Many thanks!

%PDF-1.3

1 0 obj
<<
  /Type /Catalog
  /Pages 2 0 R
>>
endobj

2 0 obj
<<
  /Type /Pages
  /Kids [3 0 R]
  /Count 1
  /MediaBox [0 0 612 792]
>>
endobj

3 0 obj
<<
  /Type /Page
  /Parent 2 0 R
  /Resources 4 0 R
  /Contents 6 0 R
  /MediaBox [0 0 612 792]
>>
endobj

4 0 obj
<<
  /Font <<
    /TT2 5 0 R
  >>
>>
endobj

5 0 obj
<<
  /BaseFont
  /OXRDVC+Helvetica
  /Subtype /TrueType
  /Type /Font
>>
endobj

6 0 obj
<<
>>
stream
q 0 54 612 648 re W n /Cs1 cs 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm Q
q 48 93.30545 516 569.4218 re W n /Cs1 cs 1 1 1 sc 48 93.30545 516 569.4218 re f 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 66.86 589.28 Tm /TT2 1 Tf (24.  ) Tj ET Q
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 96.86 40.39 Tm /TT2 1 Tf (a.  ) Tj ET Q 
endstream
endobj

trailer
<<
  /Root 1 0 R
>>

%%EOF
Chris
  • 1,501
  • 17
  • 32
  • 1
    I'll look into this sometime the next days. But please **(A)** repair your pdf - it is incomplete, at least (which is obvious) the cross reference table or stream is missing - and **(B)** share it as a binary, copying&pasting raw pdf data is very likely to damage the file. – mkl Dec 06 '20 at 16:28
  • Ahh, thanks. I've been too heavy-handed with deleting content to get it down a minimal example. I'll see if I can repair it so that it validates again. – Chris Dec 06 '20 at 18:36

1 Answers1

3

I figured out what's going on. The PDF contains a clipping rectangle that does not include 'a.'. I tried using PDFVisibleTextStripper but that stripped out text elsewhere in other documents that was in fact visible.

In the end, I wrote a class that inherits from PageDrawer and implements the showGlyph method to access the characters being drawn on the page. This method checks if the bounding box of the character is outside getGraphicsState().getCurrentClippingPath().getBounds2D().

This unfortunately means I'm not using PDFTextStripper anymore so I had to reimplement bits of its behaviour such as sorting characters by position (I was using setSortByPosition(true)). It was also a bit tricky to calculate the correct bounding box of the character based on font size and displacement.

ExtractChars.java

package com.example.foo;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.font.*;
import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import org.apache.pdfbox.util.Vector;
import java.awt.geom.*;
import java.io.*;

// This class effectively renders the PDF document in order to extract its
// text. It intercepts the showGlyph function provided by PageDrawer. We used to
// use PDFTextStripper but that has no way to exclude clipped characters.

public class ExtractChars extends PageDrawerHelper {
  // Skip erroneous characters smaller than this height. This might never happen
  // but there are places in the code that divide by height, so guard against it.
  static final float MIN_CHARACTER_HEIGHT = 0.01f;

  Processor processor;

  ExtractChars(PageDrawerParameters params, float pageHeight, int pageIndex, Processor processor) throws IOException {
    super(params, pageHeight, pageIndex);
    this.processor = processor;
  }

  // We can't move this method up to the superclass because the Renderer is
  // different each time. It needs to build an instance of the current class.
  public static void extract(PDDocument document, Processor processor) throws IOException {
    Renderer renderer = new Renderer(document);
    renderer.processor = processor;

    for (int i = 0; i < document.getNumberOfPages(); i += 1) {
      PDPage page = document.getPage(i);

      renderer.pageHeight = page.getMediaBox().getHeight();
      renderer.pageIndex = i;
      renderer.renderImage(i);
    }
  }

  @Override
  public void showGlyph(Matrix matrix, PDFont font, int _code, String unicode, Vector displacement) throws IOException {
    if (unicode == null) { return; }

    // Get the width and height of the character relative to font size.
    // The height does not change but the width does, e.g. 'M' is wider than 'I'.
    float width = displacement.getX();
    float height = fontHeight(font) / 2;

    BoundingBox charBox = clippedBoundingBox(matrix, width, height);

    // Skip the character if it is outside the clipping region and not visible.
    if (charBox == null) { return; }

    float boxHeight = charBox.bottom - charBox.top;
    if (boxHeight < MIN_CHARACTER_HEIGHT) { return; }

    // We need the text direction so we can sort text in separate buckets based on this.
    int direction = textDirection(matrix);

    processor.process(unicode, charBox, direction);
  }

  // https://stackoverflow.com/questions/17171815/get-the-font-height-of-a-character-in-pdfbox#answer-17202929
  float fontHeight(PDFont font) {
    return font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000;
  }

  int textDirection(Matrix matrix) {
    float a = matrix.getValue(0, 0);
    float b = matrix.getValue(0, 1);
    float c = matrix.getValue(1, 0);
    float d = matrix.getValue(1, 1);

    // This logic is copied from:
    // https://github.com/atsuoishimoto/pdfbox-ja/blob/master/src/main/java/org/apache/pdfbox/util/TextPosition.java
    if ((a > 0) && (Math.abs(b) < d) && (Math.abs(c) < a) && (d > 0)) {
      return 0;
    } else if ((a < 0) && (Math.abs(b) < Math.abs(d)) && (Math.abs(c) < Math.abs(a)) && (d < 0)) {
      return 180;
    } else if ((Math.abs(a) < Math.abs(c)) && (b > 0) && (c < 0) && (Math.abs(d) < b)) {
      return 90;
    } else if ((Math.abs(a) < c) && (b < 0) && (c > 0) && (Math.abs(d) < Math.abs(b))) {
      return 270;
    }

    return 0;
  }

  // We can't construct an instance of ExtractChars directly because its
  // constructor requires PageDrawerParameters which is private to the package.
  // Instead, make an instance via a renderer and forward the fields to it.
  static class Renderer extends PDFRenderer {
    Processor processor;
    float pageHeight;
    int pageIndex;

    Renderer(PDDocument document) {
      super(document);
    }

    protected PageDrawer createPageDrawer(PageDrawerParameters params) throws IOException {
      return new ExtractChars(params, pageHeight, pageIndex, processor);
    }
  }

  public interface Processor {
    void process(String character, BoundingBox box, int direction);
  }
}

PageDrawerHelper.java

package com.example.foo;

import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import java.awt.geom.*;
import java.io.*;

// This class provides utility methods to subclasses, mostly so they can check
// if the currently content is being clipped and therefore should be skipped.
//
// We shouldn't really use inheritance for sharing code but this has the
// advantage of being able to call some methods of the PageDrawer superclass.

public class PageDrawerHelper extends PageDrawer {
  float pageHeight;
  int pageIndex;

  PageDrawerHelper(PageDrawerParameters params, float pageHeight, int pageIndex) throws IOException {
    super(params);

    this.pageHeight = pageHeight;
    this.pageIndex = pageIndex;
  }

  // Gets the bounding for a matrix by transforming corner points and taking the
  // min/max values in the x- and y-directions. This ensures rotation and skew
  // are taken into account. This method can return null if content is clipped.
  BoundingBox clippedBoundingBox(Matrix matrix, float width, float height) {
    Point2D p0 = matrix.transformPoint(0, 0);
    Point2D p1 = matrix.transformPoint(0, height);
    Point2D p2 = matrix.transformPoint(width, 0);
    Point2D p3 = matrix.transformPoint(width, height);

    BoundingBox contentBox = boundingBox(p0, p1, p2, p3);
    BoundingBox clippedBox = applyClipping(contentBox);

    return clippedBox;
  }

  BoundingBox boundingBox(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
    Point2D topLeft = topLeft(p0, p1, p2, p3);
    Point2D botRight = botRight(p0, p1, p2, p3);

    float left = (float)topLeft.getX();
    float right = (float)botRight.getX();
    float top = pageHeight - (float)botRight.getY();
    float bottom = pageHeight - (float)topLeft.getY();

    return new BoundingBox(pageIndex, left, right, top, bottom);
  }

  Point2D topLeft(Point2D... points) {
    double minX = points[0].getX();
    double minY = points[0].getY();

    for (int i = 1; i < points.length; i += 1) {
      minX = Math.min(minX, points[i].getX());
      minY = Math.min(minY, points[i].getY());
    }

    return new Point2D.Double(minX, minY);
  }

  Point2D botRight(Point2D... points) {
    double maxX = points[0].getX();
    double maxY = points[0].getY();

    for (int i = 1; i < points.length; i += 1) {
      maxX = Math.max(maxX, points[i].getX());
      maxY = Math.max(maxY, points[i].getY());
    }

    return new Point2D.Double(maxX, maxY);
  }

  BoundingBox applyClipping(BoundingBox box) {
    Rectangle2D clip = getGraphicsState().getCurrentClippingPath().getBounds2D();

    float clipLeft = (float)clip.getMinX();
    float clipRight = (float)clip.getMaxX();
    float clipTop = pageHeight - (float)clip.getMaxY();
    float clipBottom = pageHeight - (float)clip.getMinY();

    float left = Math.max(box.left, clipLeft);
    float right = Math.min(box.right, clipRight);
    float top = Math.max(box.top, clipTop);
    float bottom = Math.min(box.bottom, clipBottom);

    if (left >= right || top >= bottom) {
      return null;
    } else {
      return new BoundingBox(pageIndex, left, right, top, bottom);
    }
  }
}

CharacterSorter.java

package com.example.foo;

import java.util.*;

public class CharacterSorter {
  ArrayList<String> characters;
  ArrayList<BoundingBox> boxes;
  ArrayList<Integer> directions;

  public CharacterSorter(ArrayList<String> characters, ArrayList<BoundingBox> boxes, ArrayList<Integer> directions) {
    this.characters = characters;
    this.boxes = boxes;
    this.directions = directions;
  }

  public void sortByDirectionThenPosition() {
    ArrayList<Tuple> tuples = new ArrayList();

    for (int i = 0; i < characters.size(); i += 1) {
      tuples.add(new Tuple(characters.get(i), boxes.get(i), directions.get(i)));
    }

    Collections.sort((List)tuples);
    characters.clear(); boxes.clear(); directions.clear();

    for (Tuple tuple: tuples) {
      characters.add(tuple.character);
      boxes.add(tuple.box);
      directions.add(tuple.direction);
    }
  }

  // This helper class wraps the three fields associated with a single character
  // and provides a comparator function which mimics how PDFTextStripper orders
  // its characters when #setSortByPosition(true) is set.
  class Tuple implements Comparable {
    String character;
    BoundingBox box;
    Integer direction;

    Tuple(String character, BoundingBox box, Integer direction) {
      this.character = character;
      this.box = box;
      this.direction = direction;
    }

    public int compareTo(Object o) {
      Tuple other = (Tuple)o;

      int primary = ((Integer)box.pageIndex).compareTo(other.box.pageIndex);
      if (primary != 0) { return primary; }

      // The remainder of this logic is copied and adapted from:
      // https://github.com/apache/pdfbox/blob/a78f4a2ea058181e5ed05d6367ba7556948331b8/pdfbox/src/main/java/org/apache/pdfbox/text/TextPositionComparator.java#L29-L70

      // Only compare text that is in the same direction.
      int secondary = Float.compare(direction, other.direction);
      if (secondary != 0) { return secondary; }

      // Get the text direction adjusted coordinates.
      float x1 = box.left;
      float x2 = other.box.left;

      float pos1YBottom = box.bottom;
      float pos2YBottom = other.box.bottom;

      // Note that the coordinates have been adjusted so (0, 0) is in upper left.
      float pos1YTop = pos1YBottom - (box.bottom - box.top);
      float pos2YTop = pos2YBottom - (other.box.bottom - other.box.top);

      float yDifference = Math.abs(pos1YBottom - pos2YBottom);

      // We will do a simple tolerance comparison.
      if (yDifference < .1 ||
          pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
          pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
      {
          return Float.compare(x1, x2);
      } else if (pos1YBottom < pos2YBottom) {
          return -1;
      } else {
          return 1;
      }
    }
  }
}
Chris
  • 1,501
  • 17
  • 32
  • Chris, thanks. I tried using your code, but could not find out how to initiate the ExtractChars.processor object. Could you please explain ? – Orit Nov 24 '21 at 13:44
  • ExtractChars.extract(document, (character, box, direction) -> { // your code here }); – Chris Nov 24 '21 at 15:41
  • The method signature you have provided at your solution is different: `public static void extract(PDDocument document, Processor processor)` – Orit Nov 24 '21 at 17:14
  • It's not. The provided argument in my previous comment is a lambda that is a subtype of the Processor interface defined at the bottom of ExtractChars.java. – Chris Nov 24 '21 at 20:50
  • I see, thanks. How did you implement the processor ? Can you please share its code ? – Orit Nov 25 '21 at 10:22
  • I have run that code with an empy implementation of the processor, for page 14 (zero based counting) at the following PDF: https://s25.q4cdn.com/680186029/files/doc_financials/ar-interactive/2018-interactive/ar/images/Xcel_Energy-AR2018.pdf The text `annual report 2018` is covered by an image at its top right corner. The solution you have suggested do extract that hiddent text. Do you know how can I avoid extraction of text which is covered by image ? – Orit Nov 25 '21 at 10:55
  • The Processor is implemented by the lambda. There's no need to write a separate class or anything - the lambda has the right method signature so Java accepts it as an implementor of the Processor interface. Unfortunately, I don't think I can help with your image question. You could try to extract images as well and see if their bounding boxes overlap the bounding boxes of text and infer that the text might be obscured by the images, but I'm not an expert - you'll have to try your own methods. Good luck. – Chris Nov 28 '21 at 23:31