I am currently working at a project to extract the content of pdf files and search for certain keywords in them. For extracting the content I am using PDFBox and this works fine. The problem I now have encountered is that I want to be able to search for certain keywords only within chapter headlines.
At the moment my code for extracting looks like this:
PDDocument doc = PDDocument.load(pdfFile);
String text = new PDFTextStripper().getText(doc);
doc.close();
This only extracts the raw text of the file, with no information about headlines. I was not able to figure out how to use PDFBox to include such information. So I am not sure if this is even possible.
Has anybody experience with this tool and can tell me, if its even possible to do this by using PDFBox and if yes, how I will be able to achieve this?
Kind regards