PDFBox: convert PDF to text including chapter headlineinformation

Asked Nov 20 '16 at 13:21

Active Nov 20 '16 at 13:21

Viewed 784 times

I am currently working at a project to extract the content of pdf files and search for certain keywords in them. For extracting the content I am using PDFBox and this works fine. The problem I now have encountered is that I want to be able to search for certain keywords only within chapter headlines.

At the moment my code for extracting looks like this:

PDDocument doc = PDDocument.load(pdfFile);
String text = new PDFTextStripper().getText(doc);
doc.close();

This only extracts the raw text of the file, with no information about headlines. I was not able to figure out how to use PDFBox to include such information. So I am not sure if this is even possible.

Has anybody experience with this tool and can tell me, if its even possible to do this by using PDFBox and if yes, how I will be able to achieve this?

Kind regards

edited Jun 20 '20 at 09:12

Community

asked Nov 20 '16 at 13:21

sandra.punkt

How can those headlines be recognised? More often than not pdfs are not tagged and, therefore, contain no information marking certain text pierces as headers. So there must be some other criteria, e.g. special font types, font sizes, etc. Which is it in your case? – mkl Nov 20 '16 at 13:52
In my case the headlines are centered, italic and bold but typically have the same size as the paragraph text. – sandra.punkt Nov 20 '16 at 17:08
If italic and bold effects are achieved by the use of italic and bold fonts, [this answer](http://stackoverflow.com/a/40039407/1729265) might show you the way. – mkl Nov 20 '16 at 17:55
Thank you that helped a lot! :) – sandra.punkt Nov 20 '16 at 18:45

PDFBox: convert PDF to text including chapter headlineinformation

0 Answers0