0

I have tried with PDFTextStripperByArea and PDPageContentStream classes to extract the number values from my pdf file. They work fine!

But my requirement is to use PDFTable or PDFTableExtractor class to read the pdf contents. Can you tell me what is the maven dependency and jar file I need to use to access the above said classes? Also mention the required methods to get the values from a particular position.

I have another doubt. Can we extract the table formatted data from PDF file as it is? I meant the data with rows and columns with table lines. If a page contains some text and a table, can we just read only the table headers and the rows? I have uploaded my page in GitHub. Click here! From that image, I only need the values of Gross premium, GST and Total Payable. Please let me know whether it's possible

Vengat Joy
  • 99
  • 1
  • 13
  • I have used Apache PDFBox(a free library) for PDF manipulations. idk how relevant it is to you. – zealvault Jan 23 '18 at 12:57
  • Apache PDFBox contains PDPageContentStream and **I have already tried it**. It works good!. But here, I need to use PDFTableExtractor to achieve my requirement. – Vengat Joy Jan 23 '18 at 13:03
  • Are these two classes `PDFTable` and `PDFTableExtractor` related to pdfbox? – mkl Jan 23 '18 at 13:07
  • No. They don't belong to PDFBox – Vengat Joy Jan 23 '18 at 13:11
  • Did the person who told you to use PDFTableExtractor also tell you in what software package this is? I found this https://github.com/thoqbk/traprange . There is a release at https://github.com/thoqbk/traprange/releases . – Tilman Hausherr Jan 23 '18 at 13:21
  • I tried to use that traprange.jar from github to use `PDFTableExtractor` class. But I don't know the way to add *Maven dependency* in pom.xml – Vengat Joy Jan 23 '18 at 13:39
  • Seems he/she hasn't prepared it for maven central. Have a look here: https://stackoverflow.com/questions/8871056/can-i-use-a-github-project-directly-in-maven – Tilman Hausherr Jan 23 '18 at 13:44
  • :( I don't understand from that link. Also, I just have a jar file in my hand I need to put that jar into my project and I am not sure whether it will work or not. Is there any other way to solve this? – Vengat Joy Jan 23 '18 at 13:58
  • As @TilmanHausherr proposes, traprange successfully builds on jitpack: https://jitpack.io/com/github/thoqbk/traprange/master-1.0-g5e23e5c-13/build.log – mkl Jan 23 '18 at 16:20
  • @mkl Will JitPack compile projects using Oracle Java 7? Because I am using Java7 – Vengat Joy Jan 24 '18 at 04:50
  • As you can see in the log file I linked to in my previous comment, a JDK 8 is used. As the project sets 1.7 as source and target versions, though, the JitPack jar likely has been compiled for use with java 7. – mkl Jan 24 '18 at 06:00
  • 1
    You wrote in your comment that you were instructed to try `PDFTableExtractor`. In that case, I'd say you shouldn't bother much about learning maven and choosing jdk, but focus on getting it to run somehow to see whether it solves the request or not. So the easiest would be to create a non maven project in your IDE and attach the jar file. Btw there's a tool to extract tables: tabula java. I don't know if it has an API. – Tilman Hausherr Jan 24 '18 at 09:38
  • @Joris I'd doubt that – mkl Jan 24 '18 at 13:32
  • @mkl You doubt what exactly? :p – Joris Schellekens Jan 24 '18 at 13:36
  • That *"PDFTable or PDFTableExtractor"* in the question refers to iText classes. They appear to be from the thoqbk/traprange project on github; that project is based in PDFBox. – mkl Jan 24 '18 at 13:56
  • My bad. Let me undo. – Joris Schellekens Jan 24 '18 at 14:08
  • @mkl I have another doubt. Can we extract the **table formatted data** from PDF file as it is? I meant the data with rows and columns with table lines. If a page contains some text and a table, can we just read only the table headers and the rows? I have uploaded my page here. https://github.com/vengat03/My-Workspace/blob/master/Debit_Note.jpg **Debit_Note.jpg** From that image, _I only need the values of Gross premium, GST and Total Payable_. Please let me know whether it's possible – Vengat Joy Jan 25 '18 at 04:44
  • I have no idea. It's your *requirement to use PDFTable or PDFTableExtractor class to read the pdf contents,* not mine. @Tilman was helpful in finding those classes in the thoqbk/traprange project on github. What remains is definitely your job. – mkl Jan 25 '18 at 05:18
  • If the files are all from the same source and all have the same structure, then you may be able to extract these values by using regular expressions with the standard text extraction. – Tilman Hausherr Jan 25 '18 at 09:06
  • @TilmanHausherr _what if the files differ in structure_? Does it mean that we need to give all possible starting and ending text? – Vengat Joy Jan 25 '18 at 10:52
  • You'll have a hard time. And also if some 0.00 values are displayed as blanks. – Tilman Hausherr Jan 25 '18 at 10:58

2 Answers2

2

First, don't use classes from packages com.lowagie That code is old, obsolete and no longer supported. Furthermore, this code belonged to the very early version of iText.

Afterwards a thorough investigation was done into the intellectual property rights of all the code (since iText has had a lot of contributors). When you use the old code, you may (unknowingly) be using code for which you do not have the copyright.

Second, if you just want to solve the problem of extracting numbers and tables from a PDF document, have a look at pdf2Data. It's an iText add-on that makes things a lot easier.

It gives you a nice UI, where you can build templates for data extraction. Then you can call a single method to match an existing (XML) template against an input PDF document, and you'd get a datastructure that contains all the information about the match.

http://pdf2data.online/

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
0

PDFTable

I have found two PDFTable classes:

com.lowagie.text.pdf.PdfPTable

com.itextpdf.text.pdf.PdfPTable

Documentation of both of this class (this may help you to learn the methods you need):

https://www.coderanch.com/how-to/javadoc/itext-2.1.7/com/lowagie/text/pdf/PdfPTable.html

http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/PdfPTable.html

If you want to use this classes, you can copy the dependency to your pom.file from: https://mvnrepository.com/artifact/com.itextpdf/itextpdf
https://mvnrepository.com/artifact/com.lowagie/itext - As mentioned in this link, This artifact was moved to com.itextpdf

Examples of how to use this classes you may found here:

https://developers.itextpdf.com/examples/itext-action-second-edition/chapter-4

https://www.programcreek.com/java-api-examples/index.php?api=com.lowagie.text.pdf.PdfPTable

yoav
  • 191
  • 1
  • 2
  • 10
  • I don't think that this is what Vengat wants, he asked for table extraction, but your classes are about creating tables in a PDF. – Tilman Hausherr Jan 23 '18 at 21:00
  • True @TilmanHausherr I need to read values from a pdf file.Those values will be in table format. Since the positons of the desired text keep changing for different files, I can't use the ordinary approaches i.e. _read the whole content and splitting the required content_ or _finding the text by it's exact position_. One of my seniors instructed me to try with `PDFTableExtractor` class. I am new with pdf files. – Vengat Joy Jan 24 '18 at 05:22