How to parse pdf which contain data in a tabular format using pdfbox

Question

Can anybody help me about how to extract table data using itext or pdfbox, i have have a pdf with 1000 pages, my job is to parse a pdf and store data into database.

If you want to try doing that with iText(Sharp), this thread on the iText mailing list may be of interest to you: [parse tabular data in PDF using iTextSharp](http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tt4657013.html). As @mark said in his answer, though, generic solutions are hit and miss. If your 1000 pages have very uniform tables a specially tailored extraction routine might be the best way to go. — mkl, Jan 15 '13 at 09:26
Possible duplicate of [Parsing PDF files (especially with tables) with PDFBox](https://stackoverflow.com/questions/3203790/parsing-pdf-files-especially-with-tables-with-pdfbox) — beldaz, Oct 15 '17 at 21:21

score 4 · Answer 1 · edited Jan 15 '13 at 08:31

4

PDFs do not contain any table structure elements unless is contains additional XML to define the table. Otherwise there is no structure. There is a blog article I wrote on how to find out.

Some tools like PdfBox will make an effort to guess the table but it can be hit and miss

edited Jan 15 '13 at 08:31

Alexis Pigeon

7,423
11
39
44

answered Jan 15 '13 at 08:07

mark stephens

3,205
16
19

Thanks for replying...But we have a problem that we have a pdf file which contains record of examination results, that mean some columns and rows exist in pdf. then how to parse that pdf using Pdfbox and store data into database. – itsvks Jan 15 '13 at 14:37
@user1958037 have you meanwhile tried to use PdfBox as proposed by mark or iText along the lines of the mailing list thread I referred to? What problem have you run into? Furthermore, storing data in a database is a different matter altogether, what are your issues there? – mkl Jan 16 '13 at 09:48

score 1 · Answer 2 · answered Feb 18 '14 at 13:26

you can use this code to extract the data in a string format:

PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);

then you can use java regular expression to parse row by row and load values into your java POJO beans.

How to parse pdf which contain data in a tabular format using pdfbox

2 Answers2