The code in the question Not able to read the exact text highlighted across the lines already illustrates most concepts to use for extracting text from limited content regions on a page with PDFBox.
Having studied this code, the OP still wondered in a comment:
But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?
In general the area an annotation refers to is a rectangle:
Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.
(from Table 164 – Entries common to all annotation dictionaries - in ISO 32000-1)
For some annotations types (e.g. text markups), this location value does not suffice because:
- text to markup may be written at some odd angle but the rectangle type mentioned in the specification refers to rectangles with edges parallel to the page edges; and
- text to markup may start anywhere in a line and end anywhere in another one, so the markup area is not rectangular at all but it is the union of multiple rectangular parts.
To cope with such annotation types, therefore, the PDF specification provides a more generic way to define areas:
QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order
x1 y1 x2 y2 x3 y3 x4 y4
specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x1, y1) and (x2, y2).
(from Table 179 – Additional entries specific to text markup annotations - in ISO 32000-1)
Thus, instead of the rectangle given by
PDRectangle rect = pdfAnnot.getRectangle();
in the code in the referenced question, you have to consider the quadrilaterals given by
COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));
and define regions for the PDFTextStripperByArea stripper
accordingly. Unfortunately PDFTextStripperByArea.addRegion
expects a rectangle as parameter, not some generic quadrilateral. As text usually is printed horizontally or vertically, that should not pose too big a problem.
PS One warning concerning the specification of the QuadPoints, the order may differ in real-life PDFs, cf. the question PDF Spec vs Acrobat creation (QuadPoints).