I am trying to extract data from PDF by header/table. I am not sure if the PDF data is considered to be header or table. I tried to find if there are any metadata in the PDF, but it is null instead.
Here are the PDF examples:
I want to get the lists and amount of each list from the Summary Of Charges Header.
Based on the summary of charges, there are the amount and qty for each charges.
this is my code for the metadata which metadata variable comes out as null:
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
PDDocumentInformation info = document.getDocumentInformation();
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDMetadata metadata = catalog.getMetadata();
InputStream stream = metadata.createInputStream();
}
document.close();`
and this code basically just give me chunks of texts.
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
//InputStream stream = metadata.createInputStream();
System.out.print("Text:" + text);
}
document.close();