i'm trying to extract text from this file. I'm using solution, provided by @mkl here, with some change to processTextPosition method - as a criteria, pass X of character center(not start), so this way avoid situation, when character is cut because couple of its points are clipped:
@Override
protected void processTextPosition(TextPosition text) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
Vector middle = new Vector(start.getX() + text.getWidth()/2, start.getY());
PDGraphicsState gs = getGraphicsState();
Area area = gs.getCurrentClippingPath();
if (area == null || area.contains(lowerLeftX + middle.getX(), lowerLeftY + middle.getY()))
super.processTextPosition(text);
}
However, in the attached document, still many characters are cut because of this condition (starting from the very first "Rent Roll" token). Is there any additional transformation, i should take into account? thanks in advance.