PDFBox 2.0 : First letter of some words are not visible when extracting text

Question

using PDFBox 2.0.8 and solution, provided here I'm trying to extract only text which is visible on page. Basic functionality of PdfTextStripper returns all available text despite it's really not visible on page and this is really big problem, so to get rid of it, we need to copy-paste much code from PageDrawer and consider those clip paths to decide if we should draw particular character. However, in some files (like here) first letter in some words is always out of box(in linked file see "Tenant" -"T" missing, "Monthly Rent" - "R" missing, "Pet Rent" - "R") when doing this check:

@Override
protected void processTextPosition(TextPosition text) {
    Matrix textMatrix = text.getTextMatrix();
    Vector start = textMatrix.transform(new Vector(0, 0));
    PDGraphicsState gs = getGraphicsState();
    Area area = gs.getCurrentClippingPath();
    if (area == null || area.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY()))
        super.processTextPosition(text);
}

This is always only first character of token missed, and I think which could be important, uppercase letters. So it seems there is some more weird transformation. Did anybody saw this problem? where those transformations are made in original classes? thanks a lot in advance!

[The solution](https://stackoverflow.com/questions/47908124/pdfbox-removing-invisible-text-by-clip-filling-paths-issue) you point to in turn refers to [another solution](https://stackoverflow.com/a/47396555/1729265) it in turn is based upon. There you can read that the criteria for visibility were in fact reduced to *one can assume a character to be visible iff the start of its baseline is visible*. Obviously this criterion can err sometimes, probably that is the case for your PDF. I'll have a look at it next week. — mkl, Feb 10 '18 at 20:56
I quickly looked into the PDF. Indeed, the start of the baseline of the "T" in "Tenant" for example has a *x* coordinate of 154.6799 while the clip path there clips everything below an *x* coordinate value of 154.68. So the start of the character base line is just cut off but virtually all of the character is visible. For your file, therefore, you need different criteria for visibility. As mentioned above, next week... — mkl, Feb 10 '18 at 21:08

D.F. Stones · Answer 1 · 2018-02-16T14:04:53.473

0

One possible solution here could be checking if middle(not start) of character is not out of the box, something like:

@Override
protected void processTextPosition(TextPosition text) {
    Matrix textMatrix = text.getTextMatrix();
    Vector start = textMatrix.transform(new Vector(0, 0));
    Vector middle = new Vector(start.getX() + text.getWidth()/2, start.getY());
    PDGraphicsState gs = getGraphicsState();
    Area area = gs.getCurrentClippingPath();
    if (area == null || area.contains(lowerLeftX + middle.getX(), lowerLeftY + middle.getY()))
        super.processTextPosition(text);
}

edited Feb 16 '18 at 14:04

answered Feb 12 '18 at 12:34

D.F. Stones

91
9

This basically was my idea, too. Your implementation as is, though, quite probably won't work as desired on rotated pages. I have not yet had the time to try and implement a test taking all this into account. – mkl Feb 16 '18 at 16:54

PDFBox 2.0 : First letter of some words are not visible when extracting text

1 Answers1