0

i'm trying to extract text from this file. I'm using solution, provided by @mkl here, with some change to processTextPosition method - as a criteria, pass X of character center(not start), so this way avoid situation, when character is cut because couple of its points are clipped:

@Override
protected void processTextPosition(TextPosition text) {
    Matrix textMatrix = text.getTextMatrix();
    Vector start = textMatrix.transform(new Vector(0, 0));
    Vector middle = new Vector(start.getX() + text.getWidth()/2, start.getY());
    PDGraphicsState gs = getGraphicsState();
    Area area = gs.getCurrentClippingPath();
    if (area == null || area.contains(lowerLeftX + middle.getX(), lowerLeftY + middle.getY()))
        super.processTextPosition(text);
} 

However, in the attached document, still many characters are cut because of this condition (starting from the very first "Rent Roll" token). Is there any additional transformation, i should take into account? thanks in advance.

  • 1
    Your "text777.pdf" uses page rotation (by 90°). Thus, your addition `+ text.getWidth()/2` goes in the wrong direction. Essentially you have to take into account in which direction the baseline goes after transformation with the `textMatrix = text.getTextMatrix()`. When answering [this question](https://stackoverflow.com/q/47908124/1729265) which also deals with a rotated page, I got lazy and only checked the glyph origin. – mkl Mar 08 '18 at 15:19
  • Thanks. Does in page rotation case text.getWidth() means actually height(as I understand I should add text.getWidth()/2, text.getHeight()/2 to X, Y depends on that rotation). And do you know how to get that page rotation in processTextPosition – D.F. Stones Mar 08 '18 at 16:18
  • 1
    Depending on the exact rotation of the text you may have to subtract instead of add. It should be possible to determine all this using the text matrix. – mkl Mar 09 '18 at 05:12

0 Answers0