I use the method "Annotation.getBox" from PDF Clown library to get position of highlights in text. On this way also position of bolt or italic text is retrieved. How can this be avoided? I want to get only Retancle2D from real highlights.
-
Please share some pivotal code to allow us to understand what you do. And please link to a sample PDF with which we can reproduce the issue. – mkl Aug 05 '16 at 13:13
-
PageAnnotations annotations = page.getAnnotations(); for (Annotation annotation : annotations) { highlightArea = annotation.getBox();} ...I get the annotations of page at pdf and take there position, but i get sometime also position of bold or italic text. Sorry, i can't link sample pdf because it's confidential file. – godani Aug 05 '16 at 16:27
-
Then try and find a different, non-confidential file that allows to reproduce the issue. If you cannot find any, chances are the problem is in your pdf itself. – mkl Aug 05 '16 at 19:57
1 Answers
Unfortunately the OP failed to share an example PDF. He also merely provided a very small code fragment. Thus, the following can only speculate...
The code fragment provided by the OP in a comment looks like this:
PageAnnotations annotations = page.getAnnotations();
for (Annotation annotation : annotations)
{
highlightArea = annotation.getBox();
}
Thus, he sets the variable highlightArea
to the Box
value of the final element of the annotations of a given page.
Probable reasons why highlightArea
may contain other content (sometimes some bold or italic text in the OP's case) than highlighted text:
- That final annotation probably isn't a highlight annotation altogether but of some other type.
- Assuming that final annotation is a highlight annotation, not all the content of its box are displayed as highlighted but merely the quadrilaterals in the QuadPoints annotation dictionary entry or some custom areas defined by the appearance stream of the annotation.
For the latter case confer section 12.5.6.10 "Text Markup Annotations" in the PDF specification:
QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order
x1 y1 x2 y2 x3 y3 x4 y4
specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x1, y1) and (x2, y2).
The annotation dictionary’s AP entry, if present, shall take precedence over QuadPoints; see Table 168 and 12.5.5, “Appearance Streams.”
Beware, though, Adobe Reader does not order the vertices as specified, and it furthermore does not properly display highlights with coordinates in the order as specified. Confer the stackoverflow Q&A "PDF Spec vs Acrobat creation (QuadPoints)" which is old but still applies to current Adobe Acrobat versions.
If your annotation
is an instance of TextMarkup
, you can comfortably retrieve the quadrilaterals using the TextMarkup
method getMarkupBoxes
.
Furthermore, you can retrieve the appearance streams using the Annotation
method getAppearance
. Determining which areas an appearance stream highlights may be non-trivial, though.
-
Thank you for your reply. Sorry, it's not possible for me to reproduce the issue at another file. I think, the problem ist really on my pdf file. – godani Aug 10 '16 at 10:16