PDFBox: Detecting the highlighted text in a given page

Question

PDFBox version 2.0.20

I'm trying to detect the highlighted text (appeared in the black boxes on page#5,6,7,9) for the following PDF:

https://www.courthousenews.com/wp-content/uploads/2019/01/Manafort-response.pdf

I used the solution proposed in this comment with no luck to detect them. For example: page.getAnnotations() returns empty list. Any Idea how to detect them?

K J · Accepted Answer · 2021-10-13T14:45:15.040

1

No need to detect them the original text is there, that is a classic case of redaction failure it does not matter if the highlight is black or see through yellow. Just copy and paste or export the pages as plain text.

Here we can see there is no direct relationship between the black rectangles "paths" or the text that's below them. They are independent objects on the page. Only good downstream processing could marry them together.

The zone of interest is a region of multiple rectangles with ragged edges and trying to match any text that is within or overlapping that zone of interest with variable means of clipping the text between inside and out, which is the reason redaction is a common fail. Sounds like one big challenge that requires lots and lots of honing.

[Later Edit]

The pdfbox team can give advice. and @TilmanHausherr suggested start by looking at pdfbox 2.0.2 > Calling of PageDrawer.processPage method caught exceptions

edited Oct 13 '21 at 14:45

answered Oct 13 '21 at 02:20

K J

8,045
3
14
36

Thanks for the reply, @K J. I know that I can select & copy them, but I want to detect which text is highlighted programmatically using PDFBox. – Ahmad AlMughrabi Oct 13 '21 at 02:24
That sounds like a good plan. Would you mind sharing an example of how to detect the blackouts rectangles using PDFBox? – Ahmad AlMughrabi Oct 13 '21 at 02:29
Yes, in my case, I want to use libraries to detect them because I have a large set of PDFs to work with :( But thanks for the descriptive answer, @K J! it is really useful for the manual approach! :) – Ahmad AlMughrabi Oct 13 '21 at 03:08
Good question, I need the highlighted text to group them in a separate data source. We want to use them in a different place in the UI. – Ahmad AlMughrabi Oct 13 '21 at 03:15
1

https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions – Tilman Hausherr Oct 13 '21 at 03:21
@K J, sorry for the late reply. Yes, you're right. TilmanHausherr shared a great answer :D Would you mind editing your post and referring to @TilmanHausherr 's answer so that I can mark your answer as a complete and correct answer? – Ahmad AlMughrabi Oct 13 '21 at 14:40
Yes, looks great, thanks a lot! – Ahmad AlMughrabi Oct 13 '21 at 14:53

PDFBox: Detecting the highlighted text in a given page

1 Answers1