PDF text extraction in Java

Question

I have a PDF file that was produced with iText and created with JasperReports (I don't know if it's relevant) and I was wondering if I can find some API or anything to see the structure because I need to extract text from it.

I tried with iText, PDFBox and other Java libraries but I only get text line by line and that's not what I need.
I also tried conversion in HTML, XML, DOM but I get the same result with text extraction, no structure parsed.
If I try to open it as DOCX I see that Word recognize sort of structure, for example an area that looks like a table in PDF, after conversion in DOCX it is actually a table.

I need to understand how the PDF was created, if this is possible. I know that working with PDF's is not easy, but I need to start with something useful. Thanks!

Since you are not interested in the text / content line by line, can you provide more details on what kind of information or structure you are interested in? — U880D, Jul 11 '18 at 08:23
A typical PDF does not store any more information than its plain text "line by line" – and even *that* is not a requirement nor a guarantee. (One or two characters at a time, at any x and y position, is not unusual at all.) You may get lucky with your limited workflow … but do your research and verify with a PDF object inspector if your workflow indeed *does* store this meta-information. If it doesn't, then no tool can help you. — Jongware, Jul 11 '18 at 08:32
@U880D I have blocks of text divided by a bold horizontal line (3 per page). In the first one are info that I am not interested in, in the second one I have some info divided on two columns and in the last one I have some sort of table with four columns and ~ 10 rows, and that is the info that I need, to extract this table as a table structure to get text from it. — Tobi, Jul 11 '18 at 08:38
@U880D another problem. I tried to identify what I need by coordinates(I saw that the info is stored at the same coordinates for each kind of this PDF and I made an algorithm for that ) but If the producer of the PDF changes something(adding a new line for example), my algorithm is done... — Tobi, Jul 11 '18 at 08:49
You may start with this thread about [Structure of a PDF file](https://stackoverflow.com/questions/88582/structure-of-a-pdf-file). — U880D, Jul 11 '18 at 08:55
The Datalogics PDF Java Toolkit does an excellent job of inferring structure and extracting the text from PDF files that were created *without* structure into a List of Paragraph objects, which is a List of Sentence objects, composed of Word objects in reading order. You can use the Word objects to get the word coordinates as well. — joelgeraci, Jul 11 '18 at 15:11
Who is generating the PDF from JasperReports/iText? If it is you, do you not have the structure info you are looking for at the beginning? Essentially it is unclear why are trying to analyze a PDF, which is very difficult, if on the other hand you already have the structure beforehand. — Ryan, Jul 11 '18 at 18:26
@Ryan, it is not generated by me, I download it from my bank account. It's a transaction report document. I only have the PDF and nothing else.... — Tobi, Jul 12 '18 at 14:48

score 1 · Answer 1 · answered Jul 11 '18 at 20:59

PDFTron PDFGenie can do full semantic table and paragraph extraction from a PDF file. It can generate a reflowable HTML file containing all the appropriate HTML tags for tables and paragraphs.

See this blog for more details. https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/#a-idpart7aevaluating-accuracy-of-pdf-table-recognition

You can download Windows/macOS/Linux PDFGenie command line tool here. https://www.pdftron.com/downloads/linux

Thank you. I will try this option too. – Tobi Jul 12 '18 at 14:49 — Tobi, Jul 12 '18 at 14:49

score 0 · Answer 2 · answered Jul 11 '18 at 08:18

0

One more option, we can extract from Aspose PDF also, if you want look into the below link

https://blog.aspose.com/2018/02/28/extract-text-by-paragraphs-and-convert-files-to-pdf-with-aspose.pdf/

answered Jul 11 '18 at 08:18

srinivas

1
3

1

thank you. I used ASPOSE PDF for conversion in DOCX(I forgot to mention id) because a DOCX helps me more than a PDF, but conversion does not keep any structure. As I said, if I open it from Adobe Reader as DOCX, I have some sort of formatting that is useful, but conversion does not keep any of it...there are just paragraphs. – Tobi Jul 11 '18 at 08:54

PDF text extraction in Java

2 Answers2