Java: Apache PDFbox Extract highlighted text

Question

I am using Apache PDFbox library to extract the the highlighted text (i.e., with yellow background) from a PDF file. I am totally new to this library and don't know which class from it to be used for this purpose. So far I have done extraction of text from comments using below code.

PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
    List allPages = pddDocument.getDocumentCatalog().getAllPages();
    for (int i = 0; i < allPages.size(); i++) {
    int pageNum = i + 1;
    PDPage page = (PDPage) allPages.get(i);
    List<PDAnnotation> la = page.getAnnotations();
    if (la.size() < 1) {
    continue;
    }
    System.out.println("Total annotations = " + la.size());
    System.out.println("\nProcess Page " + pageNum + "...");
    // Just get the first annotation for testing
    PDAnnotation pdfAnnot = la.get(0); 
    System.out.println("Getting text from comment = " + pdfAnnot.getContents());

Now I need to get the highlighted text, any code example will be highly appreciated.

[This question](http://stackoverflow.com/q/32608083/1729265) might help. — mkl, Oct 21 '15 at 14:45
@mkl thanks a lot for the link it really helped me. But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts? — Abid Khan, Oct 22 '15 at 04:17

score 9 · Answer 1 · answered Aug 13 '16 at 05:19

I Hope this answer help everyone who is facing the same problem.

// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
    ArrayList<String> highlightedTexts = new ArrayList<>();
    // this is the in-memory representation of the PDF document.
    // this will load a document from a file.
    PDDocument document = PDDocument.load(filePath);
    // this represents all pages in a PDF document.
    List<PDPage> allPages =  document.getDocumentCatalog().getAllPages();
    // this represents a single page in a PDF document.
    PDPage page = allPages.get(pageNumber);
    // get  annotation dictionaries
    List<PDAnnotation> annotations = page.getAnnotations();

    for(int i=0; i<annotations.size(); i++) {
        // check subType 
        if(annotations.get(i).getSubtype().equals("Highlight")) {
            // extract highlighted text
            PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();

            COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
            String str = null;

            for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {

                COSFloat ULX = (COSFloat) quadsArray.get(0+k);
                COSFloat ULY = (COSFloat) quadsArray.get(1+k);
                COSFloat URX = (COSFloat) quadsArray.get(2+k);
                COSFloat URY = (COSFloat) quadsArray.get(3+k);
                COSFloat LLX = (COSFloat) quadsArray.get(4+k);
                COSFloat LLY = (COSFloat) quadsArray.get(5+k);
                COSFloat LRX = (COSFloat) quadsArray.get(6+k);
                COSFloat LRY = (COSFloat) quadsArray.get(7+k);

                k+=8;

                float ulx = ULX.floatValue() - 1;                           // upper left x.
                float uly = ULY.floatValue();                               // upper left y.
                float width = URX.floatValue() - LLX.floatValue();          // calculated by upperRightX - lowerLeftX.
                float height = URY.floatValue() - LLY.floatValue();         // calculated by upperRightY - lowerLeftY.

                PDRectangle pageSize = page.getMediaBox();
                uly = pageSize.getHeight() - uly;

                Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
                stripperByArea.addRegion("highlightedRegion", rectangle_2);
                stripperByArea.extractRegions(page);
                String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");

                if(j > 1) {
                    str = str.concat(highlightedText);
                } else {
                    str = highlightedText;
                }
            }
            highlightedTexts.add(str);
        }
    }
    document.close();

    return highlightedTexts;
}

Thanks a lot for your example! Here are the changes to make this work with PDFBox 2.0: PDDocument document = PDDocument.load(new File(filePath)); List allPages = new ArrayList<>(); document.getDocumentCatalog().getPages().forEach(allPages::add); PDPage page = allPages.get(pageNumber); COSArray quadsArray = (COSArray) annotations.get(i).getCOSObject().getDictionaryObject(COSName.getPDFName("QuadPoints")); — dev-random, Sep 27 '16 at 08:59

score 7 · Accepted Answer · edited Jun 20 '20 at 09:12

The code in the question Not able to read the exact text highlighted across the lines already illustrates most concepts to use for extracting text from limited content regions on a page with PDFBox.

Having studied this code, the OP still wondered in a comment:

But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?

In general the area an annotation refers to is a rectangle:

Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.

(from Table 164 – Entries common to all annotation dictionaries - in ISO 32000-1)

For some annotations types (e.g. text markups), this location value does not suffice because:

text to markup may be written at some odd angle but the rectangle type mentioned in the specification refers to rectangles with edges parallel to the page edges; and
text to markup may start anywhere in a line and end anywhere in another one, so the markup area is not rectangular at all but it is the union of multiple rectangular parts.

To cope with such annotation types, therefore, the PDF specification provides a more generic way to define areas:

QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order

x₁ y₁ x₂ y₂ x₃ y₃ x₄ y₄

specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x₁, y₁) and (x₂, y₂).

(from Table 179 – Additional entries specific to text markup annotations - in ISO 32000-1)

Thus, instead of the rectangle given by

PDRectangle rect = pdfAnnot.getRectangle();

in the code in the referenced question, you have to consider the quadrilaterals given by

COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));

and define regions for the PDFTextStripperByArea stripper accordingly. Unfortunately PDFTextStripperByArea.addRegion expects a rectangle as parameter, not some generic quadrilateral. As text usually is printed horizontally or vertically, that should not pose too big a problem.

PS One warning concerning the specification of the QuadPoints, the order may differ in real-life PDFs, cf. the question PDF Spec vs Acrobat creation (QuadPoints).

Java: Apache PDFbox Extract highlighted text

2 Answers2

Linked