I am trying to parse a TABLE in PDF file and display it as CSV. I have attached sample data from PDF below(only few columns) and sample output for the same. Each column width is fixed, let's say Company Name(18 char),Amount(8 char), Type(5 char) etc. I tried using Itext and PDFBox jars to get each page data and parsed line by line, but sounds like it is not a great solution as the line breaks and page breaks in PDF are not proper. Please me let me know if there is any other appropriate solution. We want to use any open source software for this.
Asked
Active
Viewed 1,356 times
1
-
2How should a parser do that? Without your sample output I hadn't known myself which lines belong together... if your pdf is properly tagged, you might be in luck. If you shared an example pdf file instead of an image, we could look for further clues inside. – mkl Aug 10 '16 at 05:01
-
Please limit your question to either iText or Pdfbox. If needed, create 2 separate questions, one for iText and one for Pdfbox. Share your pdf. Share your code. Do not ask for a shopping list of pdf libraries, that is not allowed on StackOverflow. You need to try them out yourself first, and then ask questions when you get stuck. – Amedee Van Gasse Aug 10 '16 at 05:44
-
Hi mkl, Thank you for addressing my question. Since the number of columns, column width, the maximum number of lines a single record/tuple will span is fixed, I thought we can parse it. Also if company name spans for a max of 3 lines, amount and seller columns may span only 1 line, then the other 2 lines will be blank in amount and type columns. Very Sorry I cannot send the PDF :( – user6404269 Aug 10 '16 at 06:23
-
Hi Amedee Van Gasse, I have tested both Itext and PDFBox for this, each has its own limitations. Since I used both of them for the same problem , I tagged both. I am not asking for a shopping list here, I tried 2 solutions which were not feasible , so just wanted to check if any one has got a better approach {Example PDF->HTML/TXT FILE->CSV or any other better PDF parser} – user6404269 Aug 10 '16 at 06:28
-
My question to share a pdf still stands. mkl also asked this. Without pdf, it's guesswork. – Amedee Van Gasse Aug 10 '16 at 07:49
1 Answers
3
This is a very complex problem. There are multiple master dissertations about this even.
An easy analogy: I have 5000 puzzle-pieces, all of them are perfectly square and could fit anywhere. Some of them have pieces of lines on them, some of them have snippets of text.
However, that does not mean it can't be done. It'll just take work.
General approach:
- use iText (specifically IEventListener) to get information on all rendering events for every page
- select those rendering events that make sense for your application. PathRenderInfo and TextRenderInfo.
- Events in a pdf do not need to appear in order according to the spec. Solve this problem by implementing a comparator over IEventData. This comparator should sort according to reading order. This implies you might have to implement some basic language detection, since not every language reads left-to-right.
- Once sorted, you can now start clustering items together according to any of the various heuristics you find in literature. For instance, two characters can be grouped into a snippet of text if they follow each other in the sorted list of events (meaning they appear next to each other in reading order), if the y-position does not differ too much (subscript and superscript might screw with this), and if the x-position does not differ too much (kerning).
- Continue clustering characters until you have formed words
- Assuming you have formed words, use similar algorithm to form words into lines. Use PathRenderInfo to withhold merging words if they intersect with a line.
- Assuming you have managed to create lines, now look for tables. One possible approach is apply a horizontal and vertical projection. And look for those sub-areas in the page that (when projected) show a grid-like structure.
This high-level approach should make it painfully obvious why this is not a widely available thing. It's very hard to implement. It requires domain-knowledge of both PDF, fonts, and machine-learning.
If you are ok with commercial solutions, try out pdf2Data. It's an iText add-on that features this exact functionality.

Joris Schellekens
- 8,483
- 2
- 23
- 54