0

I am very tried to trying to read table with rows, cells of a pdf file to get records in systematic order. I have done a lot of google but i could not find best ways to do this.

So i want to ask one question about it -

Q 1- Can we read data from pdf file ?
Q 2- Can we read data from any cell of pdf table ?

I am using itext of java to do this.

Please give me any example to do this. Thanks

  • 2
    Seen [this](http://stackoverflow.com/questions/4784825/how-to-read-pdf-files-using-java)? – Benjamin Gruenbaum Apr 13 '15 at 13:19
  • 2
    Before putting up questions, you should spent some time yourself. If not, people will most likely just downvote your question. Questions that come over as "here is my task, please do all the work for me" won't fly here. – GhostCat Apr 13 '15 at 13:21

1 Answers1

2

The answer to both your questions is: It depends.

  • Suppose that you have a ZUGFeRD invoice. In that case, the invoice is a PDF/A-3 document that has an embedded file in the CII XML format. It is very easy to extract this XML and read it to get all the necessary information about the invoice. The concept of embedded or attached file that contain the source of the data used to create the PDF, or the data in an alternative form than PDF, is a technique that is used to allow what you need.
  • You can extract text from a PDF. This is explained in questions such as PDF text extraction using iText but you only get the raw text without formatting. In many cases, a PDF consists of a bunch of text and lines put on a canvas at absolute positions. A word on the page does not know if it's part of a sentence, part of a cell, etc. Unless:
  • If the PDF is a Tagged PDF, then the PDF also contains information about the structure of the content. For instance: the content will contain tags that indicate structures such as tables, table headers, table rows, table cells. If you are talking about Tagged PDFs, then it's possible to extract the text in a structured way.

In the past, we have done project where we received credit card statements from VISA, MasterCard, AmEx,... We had to extract all the expenses and store them as records in a database. We were able to achieve this, because the format of the statements was predictable: all VISA statements are created alike, hence we were able to find the pattern that allowed us to extract the data.

It goes without saying that we do not share the code we used to do this. The company that paid us for doing that project would not be pleased.

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165