How to get the underlined text from PDF file?

Question

everyone! I try to get some underlined text from PDF file by itext, it seems very difficult for me. I've searched the solution for a long time, and I've learned how to get the text's fontfamily, fontsize and text location. However, no underline. Looking forward to your help! Thank you!

From the PDF perspective, an "underline" it literally just a line that happens to be near text but is in no way related to it. If you wanted to get an underline you'd have to look for every line (or possibly rectangle or worse) and compare that to text positions. You (probably) can filter on all lines that have the same `y` coordinate at least. There does exist a possible entry if the PDF is tagged but I don't know if anyone uses that. I would say this is a rather complicated subject. — Chris Haas, Jul 12 '16 at 16:55
@ChrisHaas said it great, more simply put ... there is no such thing as "underlined text" in PDF. There is text (maybe as it could even be images) and lines. You have an impossible task. — Kevin Brown, Jul 13 '16 at 06:42
I think you're right. When I open the pdf file with Adobe Acrobat, not all the underlines can be recognized, some are considered as line graphic. It's really strange. — 詹海坤, Jul 13 '16 at 15:18

score 0 · Answer 1 · answered Jun 18 '20 at 10:18

It might not be possible with itext, but you can achieve this with pdfbox at some extent

look at this: https://stackoverflow.com/a/40039407/4353762

But beware it might not work in some cases, the library needs to know the font and descriptors of the font. if you throw a pdf with unknown type then the descriptor will return null and the code will simply break with NullPointerException.

If you want to handle NullPointerExceptions manually then you might need to look at underlines and strikeThrough methods of PDFStyledTextStripper.java

*"It might not be possible with itext, but you can achieve this with pdfbox **at some extent**"* - actually it is similarly possible with either library as the data extracted for text and paths essentially is the same for both libraries, merely presented a bit differently. — mkl, Jun 18 '20 at 10:44

How to get the underlined text from PDF file?

1 Answers1