Can anybody help me about how to extract table data using itext or pdfbox, i have have a pdf with 1000 pages, my job is to parse a pdf and store data into database.
Asked
Active
Viewed 3,877 times
5
-
1If you want to try doing that with iText(Sharp), this thread on the iText mailing list may be of interest to you: [parse tabular data in PDF using iTextSharp](http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tt4657013.html). As @mark said in his answer, though, generic solutions are hit and miss. If your 1000 pages have very uniform tables a specially tailored extraction routine might be the best way to go. – mkl Jan 15 '13 at 09:26
-
Possible duplicate of [Parsing PDF files (especially with tables) with PDFBox](https://stackoverflow.com/questions/3203790/parsing-pdf-files-especially-with-tables-with-pdfbox) – beldaz Oct 15 '17 at 21:21
2 Answers
4
PDFs do not contain any table structure elements unless is contains additional XML to define the table. Otherwise there is no structure. There is a blog article I wrote on how to find out.
Some tools like PdfBox will make an effort to guess the table but it can be hit and miss

Alexis Pigeon
- 7,423
- 11
- 39
- 44

mark stephens
- 3,205
- 16
- 19
-
Thanks for replying...But we have a problem that we have a pdf file which contains record of examination results, that mean some columns and rows exist in pdf. then how to parse that pdf using Pdfbox and store data into database. – itsvks Jan 15 '13 at 14:37
-
@user1958037 have you meanwhile tried to use PdfBox as proposed by mark or iText along the lines of the mailing list thread I referred to? What problem have you run into? Furthermore, storing data in a database is a different matter altogether, what are your issues there? – mkl Jan 16 '13 at 09:48
1
you can use this code to extract the data in a string format:
PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
then you can use java regular expression to parse row by row and load values into your java POJO beans.