Underlines are filled rectangles. They are a part of the graphic content of a PDF file. By now I can find them as PDF operators - op.getName().equals("f") - where "f" is an operator for an underline(filled rectangle) and op is an Operator from org.apache.pdfbox.contentstream.operator.Operator. I get the tokens from PDF using PDFStreamParser and then searching for underlines operators. But I need some properties of those rectangles(because I need to find how many characters are underlined), not only their existence. Does anyone know if it is possible to do this in this way? I've seen some rumors about PDPageDrawer and/or PDFGraphicsStreamEngine. Thanks for help.
How to find underlines(length of the underline or coordinates or something like that) in PDFBox 2.0?
Asked
Active
Viewed 513 times
0
-
`Underlines are filled rectangles` no, they're not. Or not always. A line could just be a line. Have a look at https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions/38933039#38933039 or https://stackoverflow.com/questions/35409283/how-to-find-table-border-lines-in-pdf-using-pdfbox – Tilman Hausherr Sep 23 '16 at 15:19
-
I've seen that post. But in my case, like I checked, they are. I will try that method to check if there are lines too, and not only filed vector rectangles. But I wonder if I can get more information about these rectangles. – Pein Sep 23 '16 at 15:29
-
PDF is a graphics format, so there are many ways to do it... Run the code in the links and look at the output. Btw to see whats in a content stream, use PDFDebugger (if you aren't already). – Tilman Hausherr Sep 23 '16 at 15:34
-
No, I didn't use PDFDebugger. I will try those. Thanks for help. – Pein Sep 23 '16 at 15:40
-
I used the first method and I figured out that in the appendRectangle method I can get the bounding box of each filled rectangle on the page and in the strokePath method i can get the bounding box of each drawn line on the page. The width of the line interests me, but I think is hard to know how many characters are above that line... – Pein Sep 29 '16 at 12:26
-
But I think I have an idea on how to do it. I know the bounds of the line, and I know every character position. Iterate through all characters, when I find that the first one is above a line, then I watch for the others, until I meet the character that is positioned at the end of the (under)line. Then I repeat the process. To find wether a char is above line or not, I make diff between Y of char and Y of line, and it has to be < 1-2 maybe 3 or 4. It might work. – Pein Sep 29 '16 at 13:20
-
I have one more question. I have found one type of line which is not recognized as a line and not as a rectangle. It's a line which is generated when you type an email link in word/PDF. Do you have any ideas what is that? – Pein Sep 30 '16 at 12:58
-
Could be a link annotation without appearance stream. The current PDFBox version renders these directly, so the trick doesn't work. You'd have to find these yourself (PDPage.getAnnotations() and then check the instance, and if it is a PDAnnotationLink, then get its data) – Tilman Hausherr Sep 30 '16 at 13:03
-
Ok, so an annotation. Thanks!! – Pein Sep 30 '16 at 13:13
-
I've found this strange type of line and I can't get correct information about it, like bounds. But I find it to be a rectangle. Watch here [link](http://prnt.sc/co7cje) – Pein Sep 30 '16 at 15:06
-
I would like to ignore this line, because it isn't an underline, just a separator. But it corrupts my heuristic. – Pein Sep 30 '16 at 15:15
-
You can get bounds with Shape.getBounds2D(). These lines are complex shapes. Also, the line contains "curveto" segments, so this could also be a trick to ignore them. Call getPathIterator to see whats in a path. – Tilman Hausherr Sep 30 '16 at 17:50