extracting data by table/header using pdfbox

Question

I am trying to extract data from PDF by header/table. I am not sure if the PDF data is considered to be header or table. I tried to find if there are any metadata in the PDF, but it is null instead.

Here are the PDF examples:

I want to get the lists and amount of each list from the Summary Of Charges Header.

Based on the summary of charges, there are the amount and qty for each charges.

this is my code for the metadata which metadata variable comes out as null:

PDDocument document = PDDocument.load(new File("invoice.pdf"));

    if(!document.isEncrypted()) {
        PDFTextStripper stripper  = new PDFTextStripper();
        stripper.setSortByPosition(true);
        String text = stripper.getText(document);
        
        PDDocumentInformation info = document.getDocumentInformation();
        PDDocumentCatalog catalog = document.getDocumentCatalog();
        PDMetadata metadata = catalog.getMetadata();
        InputStream stream = metadata.createInputStream();
    }
    document.close();`

and this code basically just give me chunks of texts.

PDDocument document = PDDocument.load(new File("invoice.pdf"));

    if(!document.isEncrypted()) {
        PDFTextStripper stripper  = new PDFTextStripper();
        stripper.setSortByPosition(true);
        String text = stripper.getText(document);
        //InputStream stream = metadata.createInputStream();
        System.out.print("Text:" + text);
    }
    document.close();

Unless the pdf is properly tagged, there is no indication in the pdf about its structure, in particular what content is a table. Unfortunately tagging information in a PDF is optional. Thus, are your pdfs tagged properly? — mkl, Oct 28 '22 at 06:34
i am not sure if my PDF is properly tagged. any way to check it ? — poojay, Oct 28 '22 at 06:43
You can try using the code from the [ExtractMarkedContent](https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/test/java/mkl/testarea/pdfbox2/extract/ExtractMarkedContent.java) test, e.g. test method `testExtractTestWPhromma` with helper method `showContent`. If that code outputs something usable, the your PDF seems tagged in a usable way. That was code for [this stack overflow answer](https://stackoverflow.com/a/54983991/1729265) in the section titled "Extraction of content with tags". — mkl, Oct 28 '22 at 09:00

extracting data by table/header using pdfbox

0 Answers0