using PDFBox 2.0.8 and solution, provided here I'm trying to extract only text which is visible on page. Basic functionality of PdfTextStripper returns all available text despite it's really not visible on page and this is really big problem, so to get rid of it, we need to copy-paste much code from PageDrawer and consider those clip paths to decide if we should draw particular character. However, in some files (like here) first letter in some words is always out of box(in linked file see "Tenant" -"T" missing, "Monthly Rent" - "R" missing, "Pet Rent" - "R") when doing this check:
@Override
protected void processTextPosition(TextPosition text) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
PDGraphicsState gs = getGraphicsState();
Area area = gs.getCurrentClippingPath();
if (area == null || area.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY()))
super.processTextPosition(text);
}
This is always only first character of token missed, and I think which could be important, uppercase letters. So it seems there is some more weird transformation. Did anybody saw this problem? where those transformations are made in original classes? thanks a lot in advance!