0

I am using Java PDFBOX to read a pdf

It is a very long pdf with more than 40 pages, and I need to extract more than 100 elements on each page, doing it manually using coordinates would take me forever.

Is there a way to get the pdf page text in rows with each empty space filled with some null value?

When I parse this table for example: enter image description here

using the code:

            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setSortByPosition(true);

            stripper.setStartPage(30);
            stripper.setEndPage(30);
            LOG.info("page 30 \n{}", stripper.getText(document));

I get this:

016         1 300 
030        17 994        41 629        15 712 
042           676           676 

The problem is that I can't tell if there are just one or two values which are which !!

Meriem
  • 29
  • 5
  • The pdf I have is static but does that mean that its coordinates do not change from computer to another or is it always the same no matter where I run my code? – Meriem Apr 19 '22 at 01:08
  • 1
    Unfortunately I suspect that you will need to use a combination of co-ordinates and the stripper method you have above, or one of the methods shown here: https://stackoverflow.com/a/17395274/1270000 A better solution would be to get the PDF generated with a dash "-" for empty values, otherwise you could use an image recognition library to work out where the columns are and generate the co-ordinates on the fly. If all the pages have tables in the same location then it's a non issue. – sorifiend Apr 19 '22 at 01:18
  • 1
    The proof-of-concept in [this answer](https://stackoverflow.com/a/45842515/1729265) contains a `LayoutTextStripper` which extends the PDFBox text stripper class to work similar as `pdftotext -layout`; beware, though, that answer still is based on PDFBox 1.8.x, so some adaptions may be necessary. A similar approach is the `PDFLayoutTextStripper` by JonathanLink, see [here](https://github.com/JonathanLink/PDFLayoutTextStripper). – mkl Apr 19 '22 at 10:34
  • @Meriem Did/Do the provided links help or do you still need some aid? – mkl May 23 '22 at 17:44

0 Answers0