0

How can I extract specific information from a pdf file and save it as a table in a xlsx file? For example, I want to take specific information from a report of a company like the release date of the file, the revenue, the loss of profit and the financial gain and update it regularly in the same xlsx table.

This is from another question from here, but I don't know what to add and if it reaches my expectations.

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripperByArea;


    public class ExtractText {

    // Usage: xxx.jar filepath page x y width height
    public static void main(String[] args) throws IOException {

        if (args.length != 6) {
            System.out.println("Help info");
            return;
        }



        // Parameters
        String filepath = args[0];

        int page = Integer.parseInt(args[1]);
        int x = Integer.parseInt(args[2]);
        int y = Integer.parseInt(args[3]);
        int width = Integer.parseInt(args[4]);
        int height = Integer.parseInt(args[5]);

        PDDocument document = PDDocument.load(new File(filepath));

        PDFTextStripperByArea textStripper = new PDFTextStripperByArea();
        Rectangle2D rect = new java.awt.geom.Rectangle2D.Float(x, y, width, height);
        textStripper.addRegion("region", rect);


        PDPage docPage = document.getPage(page);

        textStripper.extractRegions(docPage);

        String textForRegion = textStripper.getTextForRegion("region");

        System.out.println(textForRegion);
    }
    }



  • How do you intend to identify that *specific information* in the document? The code you posted extracts text from a rectangle given by its coordinates, but that obviously only works if you know the coordinates of the information in question... – mkl Jan 20 '20 at 10:09
  • I'm new to programming and this code was the nearest I could find that reaches my expectations. I intend to get specific information from a text and to put that in a table. – Arda Ant Öztürk Jan 20 '20 at 10:24
  • How do you intend to identify that *specific information* in the document? Is it in Form fields? Then use `document.getDocumentCatalog().getAcroForm()`. Is it always at the same position? Then use extraction by coordinates. Is it always preceded by a label not otherwise used in the file? Then extract all the page and look for that label. Etc. etc. etc. – mkl Jan 20 '20 at 11:57
  • I want the "Reference Instrument", "ISIN", "Currency", "Reference Agent", "Type" and the "Valuation Price (current)". They are always in the same position, but where can I see thee coordinates? – Arda Ant Öztürk Jan 20 '20 at 13:01
  • There are some visual tools that show you coordinates, e.g. the [PDFBox PDFDebugger](https://stackoverflow.com/a/55100730/1729265). – mkl Jan 20 '20 at 16:43

0 Answers0